SlideShare une entreprise Scribd logo
1  sur  47
1
Resilience: The key requirement of a [big]
[data] architecture
Let 'em know
@HuffPostCode @edwardcapriolo
2
About me

Data Architech @ Huffpo

Apache Hive
Commiter/PMC

Author: Programming Hive
− 2nd edition comming. Save
up!

Husband & dad

Crazed inventor:
github.com/edwardcapriolo
3
Huffingtonpost & Me

What is huffingtonpost?
− News, blogs, and video
− Desktop and mobile
− Multiple editions worldwide

What do I do there?
− Provide APIs, dashboards, reports
− Crunch BigData using uber tech
− Say no to bad tech decisions
via 'ed says no' meme
4
For the next hour...
I am going to present
that everything I have
designed and use is
perfect and it never
breaks!
5
Reality check: Things break
all the time.

Anomalous cloud outages

External software bugs

Internal software bugs
− Aka. Anomalous cloud outage in
post mortem

Fat fingers

Preventable failures
6

To be resilient, design a system
that causes minimal panic
when something does break
7
What does Resilience
not sound like?

'HADOOP IS DOWN'
− “We are losing data!
Call OPS!”

“One of the 10
NoSQL nodes is
down”
− “Users are seeing
inaccurate numbers,
and request are
failing!”

Why is this bad?
8
What does Resilience sound like?

'Hadoop is down'
− No problem. The
process loading
hadoop can queue
messages for up to
40 hours

'One of the 10
NoSQL nodes is
down'
− No Problem. We can
tolerate multiple node
failures with minimal
9
Agenda

Software stacks (especially our 'Fright stack')

Planning for building a resilient service

Redundancy

Component Overview

Case study: Building the Lifetime API

Questions
10
Data Eng Stack
at Huffpo (Fright Stack)
11
Dont be scurred!

Compontents are named after horror movies

Batch & Realtime aka 'Lamb Duh' architecture
− Accomplish lower hanging fruit in real-time
− Expensive/complex processing in batch

Designed for throughput

Designed for horizontal scale

Less is more
12
Components of streaming stack

Kafka : The strong silent type
− Persistent, distribued, commit log
− Massive Throughput without GC issues at scale

Cassandra : In my Column Family
− Cells : Columns that hold values (last update wins)
− Counters : Columns that support Increment
operation

Teknek: KISS stream platform
− No Single Point of Failure
− Simple Take data off Feed apply function
13
Key compoents of batch stack

Hive / Hadoop : The big hammer
− SQL on Hadoop
− Flexibility of formats
− UDF / Streaming
− MetaStore

Impala: The scalpel
− Interactive speeds for reasonable datasets
− Avoids of having to bulk load into OLAP datastore
14
Planning a [big][data] service
15
What developers want

Hadoop

Hive

Mahout

Spark

Cascading

Python

Scala

Thrift

Storm

SQL

NoSQL

Transactions

Web
Services

Micro
Services

Message
Queue

Zookeeper

Akka

NodeJs
16
What Operations want

MySQL on RDS

[end slide]
17
What users want

Cloud

Big Data

Real Time

Reactive
− Wait I think I meant... Responsive?
18
After the initial excitement
of X, everyone:

Expexts someone else to manage X

Is more excited to work with Y

Will preach to you that X is a backwards
technology holding everything back
− Even though they were a staunch advocate for X
months ago

Everyone includes you
Source:
http://www.chrisunderwoodsblog.com/2014/0
1/new-deal-trough-or-plateau.html
19
Planning the service life-cycle

Build a playbook of setup/administration tasks

Get multiple groups buy in

Determine who carries out schema changes,
planning upgrades, etc

Build monitoring and determine escalations
20
Performance demands
on the service

Acceptable performance
− Request latenty
− 99th percentile
− Job time

Requests per second

Storage requirements

Acceptable caching/delay
21
Redundancy
22
23
Redundancy
24
What redundency does for you

Less chance that single event causes panic

Less manuals/wikis about what to do if...

Less user facing issues

More peace of mind

Availability of N services

Active/Passive is old school
Active/Active/Active + scalable is hip
25
Do not agile your redundancy

Be very afraid if someone tries to convince of
anything that sounds like this:
− For MVP we do not need Namenode HA. We can
get it running now and add the HA later.
− For MVP we need to get solution X working. We
can worry about scaling it later.
− For MVP it does not have to respond quickly. We
won't have much load and can speed it up
later.
26
Later always comes
before you are ready for it
27
Component Overview
28
Criteria for software selection

Initial setup and ongoing administration

General Utility (duct tape vs star screwdriver)

'Web Scale' design effort

Customizable/pluggable

No 'at scale' gotchas

Insane specialty superpower
29
Apache Kafka

Replication: set per topic (2)

Scale: Partitions dictate clients (10, 100)

Durability: sync vs async producers

Idempotence: Messages persisted to disk

Idempotence: Messages are multiplexed

Performance: Insane throughput
30
Message Queues
without persistence

Producer might be too fast for consumers and
messages are dropped
− You would need + 100% capacity to safely deal
with all surges

Consumer crash results in dropped message
− You cannot stop for anything, not even an update,
without loosing data
31
Apache Kafka
with persistence

Can handle traffic surges

Can safely queue data for upgrades
− Disk is cheap

Can replay data (bad release/backfill)

Multiplex data to multiple consumer groups
32
Apache Cassandra

Replication: At the keyspace level (3)

Redundant: No Single Point of Failure

Durability: Self healing with quorum

Idempotence: Cell writes

Idempotence: Compare and Swap

Performance: Lightning fast writes
33
Cassandra @ Work

Counters and Column Family can model a
good number of low latency stats problems

BatchMutations and stream save round trips

Clients do not need shard awareness

Masterless design ideal for high availability

To read it you have to be able to write it first
34
Apache Hadoop

Replication: Set per file (3)

Scale: Storage is incremental

Durability: Limit semantics

Performance: Typically brute force

Tuning: Too many tunes

Redundancy: Too many parts
35
Case Study: Lifetime API
36
Lifetime API

Result data per entry
− GET /api/lifetime/5656
− { „views“: 45454545, „clicks“: 343434 }

Provide the total lifetime sum
− Views
− Facebook shares
− Etc

Also provide 28 day counts
37
Planning

Acceptable performance
− Used in edit dashboards via web service call

Request per second
− Hundreds to thousands

Storage requirements
− Single value for each column *

Freshness
− Update hourly
38
Previous Vertica implementation

Does some queries sick fast

Enforces primary key on read
− If your double insert, later reads fail

Query slots limiting (OLAP)

Many projections can be problematic

Updates and deletes are PITA

Stonebreaker and I have beef
− http://www.edwardcapriolo.com/roller/edwardcapri
olo/entry/hadoop_is_the_best_thing
39
Let's NoSQL it!

Design for the read path
− Only fetch one entry at a time

Fetch entire history

Fetch last 28 days

Many entries have short shelf life

Do not store a single value, store a by-day
timeline instead!
40
Data modeling:
'Fixed' columns by day

Key = Entry:5555
− [2015-09-01:Views] = 30
− [2015-09-01:Clicks] = 10
− [2015-09-02:Views] = 2

Sparse data

Ordered by time
− Allows us to efficiently ask for ranges of data
41
Data modeling:
Dealing with $hipforaday
social networks

Key = Entry:5555
− [2015-09-01:networks/zintrest/zshares] = 22
− [2015-09-01:networks/zintrest/zlikes] = 10
− [2015-09-01:networks/dug/dougs] = 2

Two level dynamic:
− network/type
− True old schoolers mash strings bra

Schema-less is eloquent with the social
networks!
− Explain ire of schema and social networks
42
Updating hourly

Had hourly updates descoped until:

Asks 'Do we update hourly?'
− Of course, someone says yes
43
Data Modeling: Multiple granularity
in same row with TTL

Compute daily data once a day

Houlry data with time-to-live during the day

Entry:5555
− [2015-09-01-01:Views] = 30 *ttl 24 hours
− [2015-09-01-01:Clicks] = 10 *ttl 24 hours
− [2015-09-01-02:Views] = 2 *ttl 24 hours

API needs some intelligence not to count
hourly data if the daily column exists
− Could have named these columns so that they
always appear at the beginning or end of the data
44
Compute in batch write to NoSQL

Hive Queries from
scheduler produce
hourly data

Hour data
aggregated into day
table

TheRing: HCat API
[table] -> Cassandra
45
Results

Entry data divided
evenly across cluster

Survive multiple node
failures

API sums data on
read path

Horizontally scalable
http://sparkletechthoughts.blogspot.com/2013/03/how-to-setup-cassandra-cluster-using.html
46
How Resilient is this service?

Hourly/Daily processing can easily be re-run

Bulk loading cells is idempotent

NoSQL (Cassandra) has fault tollerance

NoSQL can take massive load

API server is stateless easily load balanced
47
? Questions ?

Contenu connexe

Tendances

Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 

Tendances (20)

August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Case Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with DruidCase Study: Realtime Analytics with Druid
Case Study: Realtime Analytics with Druid
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
 
Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summit
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 

En vedette

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Critical Regionalism
Critical RegionalismCritical Regionalism
Critical Regionalism
Surya Ramesh
 

En vedette (9)

Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Acano Solution Resilient Archicture
Acano Solution Resilient ArchictureAcano Solution Resilient Archicture
Acano Solution Resilient Archicture
 
Panel 4: New Planning Forms
Panel 4: New Planning FormsPanel 4: New Planning Forms
Panel 4: New Planning Forms
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
A Holistic Approach Towards International Disaster Resilient Architecture by ...
A Holistic Approach Towards International Disaster Resilient Architecture by ...A Holistic Approach Towards International Disaster Resilient Architecture by ...
A Holistic Approach Towards International Disaster Resilient Architecture by ...
 
Critical Regionalism
Critical RegionalismCritical Regionalism
Critical Regionalism
 
Patterns of resilience
Patterns of resiliencePatterns of resilience
Patterns of resilience
 
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
 

Similaire à Resilience: the key requirement of a [big] [data] architecture - StampedeCon 2015

Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 

Similaire à Resilience: the key requirement of a [big] [data] architecture - StampedeCon 2015 (20)

Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Exploring Phantom Traffic Jams in Your Data Flows
Exploring Phantom Traffic Jams in Your Data Flows Exploring Phantom Traffic Jams in Your Data Flows
Exploring Phantom Traffic Jams in Your Data Flows
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
The All-In-One Package for Massively Multicore, Heterogeneous Jobs with Hotsp...
The All-In-One Package for Massively Multicore, Heterogeneous Jobs with Hotsp...The All-In-One Package for Massively Multicore, Heterogeneous Jobs with Hotsp...
The All-In-One Package for Massively Multicore, Heterogeneous Jobs with Hotsp...
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 

Plus de StampedeCon

Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 

Plus de StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Resilience: the key requirement of a [big] [data] architecture - StampedeCon 2015

  • 1. 1 Resilience: The key requirement of a [big] [data] architecture Let 'em know @HuffPostCode @edwardcapriolo
  • 2. 2 About me  Data Architech @ Huffpo  Apache Hive Commiter/PMC  Author: Programming Hive − 2nd edition comming. Save up!  Husband & dad  Crazed inventor: github.com/edwardcapriolo
  • 3. 3 Huffingtonpost & Me  What is huffingtonpost? − News, blogs, and video − Desktop and mobile − Multiple editions worldwide  What do I do there? − Provide APIs, dashboards, reports − Crunch BigData using uber tech − Say no to bad tech decisions via 'ed says no' meme
  • 4. 4 For the next hour... I am going to present that everything I have designed and use is perfect and it never breaks!
  • 5. 5 Reality check: Things break all the time.  Anomalous cloud outages  External software bugs  Internal software bugs − Aka. Anomalous cloud outage in post mortem  Fat fingers  Preventable failures
  • 6. 6  To be resilient, design a system that causes minimal panic when something does break
  • 7. 7 What does Resilience not sound like?  'HADOOP IS DOWN' − “We are losing data! Call OPS!”  “One of the 10 NoSQL nodes is down” − “Users are seeing inaccurate numbers, and request are failing!”  Why is this bad?
  • 8. 8 What does Resilience sound like?  'Hadoop is down' − No problem. The process loading hadoop can queue messages for up to 40 hours  'One of the 10 NoSQL nodes is down' − No Problem. We can tolerate multiple node failures with minimal
  • 9. 9 Agenda  Software stacks (especially our 'Fright stack')  Planning for building a resilient service  Redundancy  Component Overview  Case study: Building the Lifetime API  Questions
  • 10. 10 Data Eng Stack at Huffpo (Fright Stack)
  • 11. 11 Dont be scurred!  Compontents are named after horror movies  Batch & Realtime aka 'Lamb Duh' architecture − Accomplish lower hanging fruit in real-time − Expensive/complex processing in batch  Designed for throughput  Designed for horizontal scale  Less is more
  • 12. 12 Components of streaming stack  Kafka : The strong silent type − Persistent, distribued, commit log − Massive Throughput without GC issues at scale  Cassandra : In my Column Family − Cells : Columns that hold values (last update wins) − Counters : Columns that support Increment operation  Teknek: KISS stream platform − No Single Point of Failure − Simple Take data off Feed apply function
  • 13. 13 Key compoents of batch stack  Hive / Hadoop : The big hammer − SQL on Hadoop − Flexibility of formats − UDF / Streaming − MetaStore  Impala: The scalpel − Interactive speeds for reasonable datasets − Avoids of having to bulk load into OLAP datastore
  • 16. 16 What Operations want  MySQL on RDS  [end slide]
  • 17. 17 What users want  Cloud  Big Data  Real Time  Reactive − Wait I think I meant... Responsive?
  • 18. 18 After the initial excitement of X, everyone:  Expexts someone else to manage X  Is more excited to work with Y  Will preach to you that X is a backwards technology holding everything back − Even though they were a staunch advocate for X months ago  Everyone includes you Source: http://www.chrisunderwoodsblog.com/2014/0 1/new-deal-trough-or-plateau.html
  • 19. 19 Planning the service life-cycle  Build a playbook of setup/administration tasks  Get multiple groups buy in  Determine who carries out schema changes, planning upgrades, etc  Build monitoring and determine escalations
  • 20. 20 Performance demands on the service  Acceptable performance − Request latenty − 99th percentile − Job time  Requests per second  Storage requirements  Acceptable caching/delay
  • 22. 22
  • 24. 24 What redundency does for you  Less chance that single event causes panic  Less manuals/wikis about what to do if...  Less user facing issues  More peace of mind  Availability of N services  Active/Passive is old school Active/Active/Active + scalable is hip
  • 25. 25 Do not agile your redundancy  Be very afraid if someone tries to convince of anything that sounds like this: − For MVP we do not need Namenode HA. We can get it running now and add the HA later. − For MVP we need to get solution X working. We can worry about scaling it later. − For MVP it does not have to respond quickly. We won't have much load and can speed it up later.
  • 26. 26 Later always comes before you are ready for it
  • 28. 28 Criteria for software selection  Initial setup and ongoing administration  General Utility (duct tape vs star screwdriver)  'Web Scale' design effort  Customizable/pluggable  No 'at scale' gotchas  Insane specialty superpower
  • 29. 29 Apache Kafka  Replication: set per topic (2)  Scale: Partitions dictate clients (10, 100)  Durability: sync vs async producers  Idempotence: Messages persisted to disk  Idempotence: Messages are multiplexed  Performance: Insane throughput
  • 30. 30 Message Queues without persistence  Producer might be too fast for consumers and messages are dropped − You would need + 100% capacity to safely deal with all surges  Consumer crash results in dropped message − You cannot stop for anything, not even an update, without loosing data
  • 31. 31 Apache Kafka with persistence  Can handle traffic surges  Can safely queue data for upgrades − Disk is cheap  Can replay data (bad release/backfill)  Multiplex data to multiple consumer groups
  • 32. 32 Apache Cassandra  Replication: At the keyspace level (3)  Redundant: No Single Point of Failure  Durability: Self healing with quorum  Idempotence: Cell writes  Idempotence: Compare and Swap  Performance: Lightning fast writes
  • 33. 33 Cassandra @ Work  Counters and Column Family can model a good number of low latency stats problems  BatchMutations and stream save round trips  Clients do not need shard awareness  Masterless design ideal for high availability  To read it you have to be able to write it first
  • 34. 34 Apache Hadoop  Replication: Set per file (3)  Scale: Storage is incremental  Durability: Limit semantics  Performance: Typically brute force  Tuning: Too many tunes  Redundancy: Too many parts
  • 36. 36 Lifetime API  Result data per entry − GET /api/lifetime/5656 − { „views“: 45454545, „clicks“: 343434 }  Provide the total lifetime sum − Views − Facebook shares − Etc  Also provide 28 day counts
  • 37. 37 Planning  Acceptable performance − Used in edit dashboards via web service call  Request per second − Hundreds to thousands  Storage requirements − Single value for each column *  Freshness − Update hourly
  • 38. 38 Previous Vertica implementation  Does some queries sick fast  Enforces primary key on read − If your double insert, later reads fail  Query slots limiting (OLAP)  Many projections can be problematic  Updates and deletes are PITA  Stonebreaker and I have beef − http://www.edwardcapriolo.com/roller/edwardcapri olo/entry/hadoop_is_the_best_thing
  • 39. 39 Let's NoSQL it!  Design for the read path − Only fetch one entry at a time  Fetch entire history  Fetch last 28 days  Many entries have short shelf life  Do not store a single value, store a by-day timeline instead!
  • 40. 40 Data modeling: 'Fixed' columns by day  Key = Entry:5555 − [2015-09-01:Views] = 30 − [2015-09-01:Clicks] = 10 − [2015-09-02:Views] = 2  Sparse data  Ordered by time − Allows us to efficiently ask for ranges of data
  • 41. 41 Data modeling: Dealing with $hipforaday social networks  Key = Entry:5555 − [2015-09-01:networks/zintrest/zshares] = 22 − [2015-09-01:networks/zintrest/zlikes] = 10 − [2015-09-01:networks/dug/dougs] = 2  Two level dynamic: − network/type − True old schoolers mash strings bra  Schema-less is eloquent with the social networks! − Explain ire of schema and social networks
  • 42. 42 Updating hourly  Had hourly updates descoped until:  Asks 'Do we update hourly?' − Of course, someone says yes
  • 43. 43 Data Modeling: Multiple granularity in same row with TTL  Compute daily data once a day  Houlry data with time-to-live during the day  Entry:5555 − [2015-09-01-01:Views] = 30 *ttl 24 hours − [2015-09-01-01:Clicks] = 10 *ttl 24 hours − [2015-09-01-02:Views] = 2 *ttl 24 hours  API needs some intelligence not to count hourly data if the daily column exists − Could have named these columns so that they always appear at the beginning or end of the data
  • 44. 44 Compute in batch write to NoSQL  Hive Queries from scheduler produce hourly data  Hour data aggregated into day table  TheRing: HCat API [table] -> Cassandra
  • 45. 45 Results  Entry data divided evenly across cluster  Survive multiple node failures  API sums data on read path  Horizontally scalable http://sparkletechthoughts.blogspot.com/2013/03/how-to-setup-cassandra-cluster-using.html
  • 46. 46 How Resilient is this service?  Hourly/Daily processing can easily be re-run  Bulk loading cells is idempotent  NoSQL (Cassandra) has fault tollerance  NoSQL can take massive load  API server is stateless easily load balanced