SlideShare a Scribd company logo
1 of 71
The Incremental Path to Observability
@eanakashima
about://
Emily Nakashima
dir. engineering @ honeycomb.io
Observability for Production Systems:
debug with Events, Traces, and Logs in real time.
Instrumenting your code so that
you can deeply understand the
state of your system just by
observing its outputs
Observability
Instrumenting your code so that
you can deeply understand the
state of your system just by
observing its outputs
Observability
How you collect the data you
need to achieve observability —
usually code you write or
include in your applications and
services
Instrumentation
Instrumentation
That sounds like
a lot of work!
Why trust you?
Isn’t this…
your job?
We’re a tiny, early startup
We have to work on building the product
as fast as we can — not instrumentation
We’re a tiny, early startup
~3% of initial project == instrumentation
~1% of ongoing work == instrumentation
We’re a tiny, early startup
~3% of initial project == instrumentation
~1% of ongoing work == instrumentation
Toshok
Christine Charity
Ben
Ginsu
The Incremental Path
1. Metrics & Logs
1a. Structured Logs
2. Events
3. Traces
4. Gotchas & Next Steps
The Incremental Path
1. Metrics & Logs
1a. Structured Logs
2. Events
3. Traces
4. Gotchas & Next Steps
1. Metrics & Logs
Metrics & Logs
• Metrics: signalFX metrics SaaS (could have been Datadog, Wavefront,
Prometheus, etc.)
• We still get ~1-2 useful metrics-based alerts per week
• Logs: just regular log files written to disk on EC2 (no log aggregation SaaS)
• Logs are a final debugging step, not a starting one
“SSH’ing into any server is an
antipattern, a failure”
- Charity Majors
Metrics & Logs
• How: the lowest-effort way you won’t regret later
• Why: don’t fly blind
• What you get: host-based & aggregated alerts; logs of wtf happened
• When you’re ready for the next step: high-cardinality questions your
metrics & logs can’t quickly answer
High-cardinality questions your metrics & logs can’t answer
What
happened
here?
1a. Structured Logs
1a. Structured Logs
• Key-value pairs (JSON)
• Self-describing
• Human-readable, just like traditional logs
https://www.honeycomb.io/blog/you-could-have-invented-structured-logging/
•
•
•
• A good transitional step to using Events. Send to your Events backend.
• Key-value pairs (JSON)
• Self-describing
• Human-readable, just like traditional logs
1a. Structured Logs
Structured Logs: almost Events
Structured Logs: almost Events
Events
• Can look like structured logs; might even be structured logs!
• Key-value pairs (JSON)
• Represent a complete unit of work for your system
2. Events
https://www.honeycomb.io/blog/how-are-structured-logs-different-from-events/
What is an event?
1 event
==
1 unit of work
~=
1 request
"Host": "127.0.0.1:8080",
"IsXHR": true,
"Method": "POST",
"RequestURI": "/user_event/page-unload",
"ResponseContentLength": 443,
"ResponseHttpStatus": 200,
"ResponseTime_ms": 123,
"Timestamp": "2018-03-02T06:14:57.206349701Z",
"UserEmail": "nathan@honeycomb.io",
"UserID": 18,
"availability_zone": "us-east-1b",
"build_id": "6552",
"env": "dogfood",
"infra_type": "aws_instance",
"instance_type": "t2.micro",
"memory_inuse": 15450056,
"num_goroutines": 56,
"request_id": "poodle-a38f5e39/5fIUGkX5D1-001814",
"server_hostname": "poodle-a38f5e39",
"type": "request"
},
High cardinality
Fields that may have many unique values
Common examples:
• email address
• username / user id / team id
• server hostname
• IP address
• user agent string
• build id
• request url
• feature flags / flag combinations
if ok {
ev.AddField("user.id", userID)
}
return ev
}
func signupHandlerGet(w http.ResponseWriter, r *http.Request) {
var err error
ev := eventFromRequest(r)
defer addFinalErr(&err, ev)
tmpl := filepath.Join(templatesDir, "signup.html"),
tmplData := struct {
ErrorMessage string
}{}
if err = tmpl.Execute(w, tmplData); err != nil {
log.Print(err)
}
}
See https://github.com/honeycombio/examples for more
Writing event
instrumentation code:
Not scary!
Where do I start instrumenting?
Start at the edge.
Instrument the surfaces
customers touch first.
Honeycomb
Architecture
“Nines don’t matter if users
aren’t happy.”
- Charity Majors, all the time
Storytime: the mysterious metrics spike
Storytime: the mysterious metrics spike
Storytime: the mysterious metrics spike
Storytime: the mysterious metrics spike
Storytime: the mysterious metrics spike
Storytime: the mysterious metrics spike
Storytime: the mysterious metrics spike
Storytime: the mysterious metrics spike
• Best: dedicated Event tool — Honeycomb, NewRelic Insights, etc.
• Good: tracing tool (just send root spans) — Honeycomb, Jaeger, Lightstep…
• Ok: log aggregation & analytics tool (if it’s blazing fast)
• ¯_(ツ)_/¯: metrics tool (if it has high-cardinality support for many tags)
Tool Choice
Events
• How: bit by bit, starting from the edge
• Why: understand faster
• What you get: context, high-cardinality support
• When you’re ready for the next step: you’re spending lots of time on
backend perf issues, you have complex multi-service bugs, or your team no
longer has a complete mental model of how your services talk to each other
Tracing
Honeycomb
Architecture
Events vs. Traces
• Traces require fields like parent_id, span_id, trace_id, duration
• Traces are composed of many Events — usually called Spans when they
are part of a larger Trace
Events vs. Traces
Instrumentation choices
• Better turnkey instrumentation (db & http wrappers, etc.):
• Vendor agent for an APM product (New Relic, Datadog)
• Vendor integration for a tracing product (Honeycomb)
• More open:
• OpenCensus / OpenTracing (merging)
metadata["durationMs"] = float64(time.Since(start)) / float64(time.Millisecond)
if parentID := parentIDFromContext(ctx); parentID != "" {
metadata["parentId"] = parentID
}
ev := libhoney.NewEvent()
ev.Timestamp = start
ev.Add(metadata)
ev.Send()
}
func editHandler(w http.ResponseWriter, r *http.Request, title string) {
loadPageID := newID()
loadPageStart := time.Now()
p, err := loadPage(newContextWithParentID(r.Context(), loadPageID), title)
sendSpan("loadPage", loadPageID, loadPageStart, r.Context(), map[string]interface{}{"title": title, "error": err})
if err != nil {
p = &Page{Title: title}
}
renderID := newID()
renderStart := time.Now()
renderTemplate(newContextWithParentID(r.Context(), renderID), w, "edit", p)
sendSpan("renderTemplate", renderID, renderStart, r.Context(), map[string]interface{}{"template": "edit"})
}
See https://github.com/honeycombio/examples for more
Writing tracing
instrumentation code:
Somewhat scary!
Take it a span at a time
Switching our web app to traces
• 1 week hackathon for ~2 people to get the big plumbing in
• 3 months of 2-3 engineers adding context in our spare time (yikes)
• Ongoing instrumentation improvements as we need them, an hour or two
per month
Storytime: the stampede
Storytime: the stampede
Storytime: the stampede
// SaveToCache caches the dataset by ID
func (d *Dataset) SaveToCache(ctx context.Context, cacheService cache.Service, mysqlService mysql.Service) {
cacheKey := datasetCacheKey(d.ID)
cacheService.SetWithTTL(ctx, cacheKey, d, datasetCacheTTL, func(innerCtx context.Context, key string, value
interface{}) (interface{}, error) {
return FindDatasetByID(innerCtx, mysqlService, d.ID)
})
}
Storytime: the stampede
Storytime: the stampede
Storytime: the stampede
Tracing
• How: start at your hot spots
• Why: dive deeper
• What you get: faster perf & distributed systems debugging
• When you’re ready for the next step: lol no, you’ll be working on tracing
instrumentation forever
How not to shoot your own foot
Sampling
• Tracing sends much more data — don’t surprise your CFO with a double-
digit percent increase in your AWS bill like I did!
• Select & send representative events, not all events.
• Use the dynsampler library or write your own logic
• Be sure to set the `sample_rate` field on your events
Canopy paper: https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/
Ben’s Sampling talk: https://www.usenix.org/conference/lisa17/conference-program/presentation/hartshorne
Schema churn
The best schema:
• Is consistent between services
• Plans ahead for field names (follow OpenCensus field conventions?)
• Uses namespacing!
Internal Resistance
Start in !prod
Use events or traces to debug something low-risk and high-value, like:
• build pipeline — find bottlenecks
• test suite — quantify your test flakiness
Build pipeline: find bottlenecks
Build pipeline: find bottlenecks
Build pipeline: find bottlenecks
Thank you!
@eanakashima
honeycomb.io

More Related Content

What's hot

Using JavaScript to Put the "Internet" in IoT
 Using JavaScript to Put the "Internet" in IoT Using JavaScript to Put the "Internet" in IoT
Using JavaScript to Put the "Internet" in IoTKevin Swiber
 
Hadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionHadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionMapR Technologies
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysisMichael Boman
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 
Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming ApplicationsC4Media
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
 
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...DataStax Academy
 
Apache Eagle Dublin Hadoop Summit 2016
Apache Eagle   Dublin Hadoop Summit 2016Apache Eagle   Dublin Hadoop Summit 2016
Apache Eagle Dublin Hadoop Summit 2016Edward Zhang
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
 
DEEPSEC 2013: Malware Datamining And Attribution
DEEPSEC 2013: Malware Datamining And AttributionDEEPSEC 2013: Malware Datamining And Attribution
DEEPSEC 2013: Malware Datamining And AttributionMichael Boman
 
Solving DEBS Grand Challenge with WSO2 CEP
Solving DEBS Grand Challenge with WSO2 CEPSolving DEBS Grand Challenge with WSO2 CEP
Solving DEBS Grand Challenge with WSO2 CEPSrinath Perera
 
Spark war stories taboola
Spark war stories taboolaSpark war stories taboola
Spark war stories taboolatsliwowicz
 
Performing network security analytics
Performing network security analyticsPerforming network security analytics
Performing network security analyticsDataWorks Summit
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
 

What's hot (20)

Using JavaScript to Put the "Internet" in IoT
 Using JavaScript to Put the "Internet" in IoT Using JavaScript to Put the "Internet" in IoT
Using JavaScript to Put the "Internet" in IoT
 
Hadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionHadoop / Spark on Malware Expression
Hadoop / Spark on Malware Expression
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Data automation 101
Data automation 101Data automation 101
Data automation 101
 
44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis44CON 2014: Using hadoop for malware, network, forensics and log analysis
44CON 2014: Using hadoop for malware, network, forensics and log analysis
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming Applications
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
 
Apache Eagle Dublin Hadoop Summit 2016
Apache Eagle   Dublin Hadoop Summit 2016Apache Eagle   Dublin Hadoop Summit 2016
Apache Eagle Dublin Hadoop Summit 2016
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
DEEPSEC 2013: Malware Datamining And Attribution
DEEPSEC 2013: Malware Datamining And AttributionDEEPSEC 2013: Malware Datamining And Attribution
DEEPSEC 2013: Malware Datamining And Attribution
 
Solving DEBS Grand Challenge with WSO2 CEP
Solving DEBS Grand Challenge with WSO2 CEPSolving DEBS Grand Challenge with WSO2 CEP
Solving DEBS Grand Challenge with WSO2 CEP
 
Spark war stories taboola
Spark war stories taboolaSpark war stories taboola
Spark war stories taboola
 
Performing network security analytics
Performing network security analyticsPerforming network security analytics
Performing network security analytics
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 

Similar to The Incremental Path to Observability

Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout SessionSplunk
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputinginside-BigData.com
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsthelabdude
 
SplunkLive! Presentation - Data Onboarding with Splunk
SplunkLive! Presentation - Data Onboarding with SplunkSplunkLive! Presentation - Data Onboarding with Splunk
SplunkLive! Presentation - Data Onboarding with SplunkSplunk
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
 
SHOWDOWN: Threat Stack vs. Red Hat AuditD
SHOWDOWN: Threat Stack vs. Red Hat AuditDSHOWDOWN: Threat Stack vs. Red Hat AuditD
SHOWDOWN: Threat Stack vs. Red Hat AuditDThreat Stack
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayGrega Kespret
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineTrieu Nguyen
 

Similar to The Incremental Path to Observability (20)

Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
SplunkLive! Presentation - Data Onboarding with Splunk
SplunkLive! Presentation - Data Onboarding with SplunkSplunkLive! Presentation - Data Onboarding with Splunk
SplunkLive! Presentation - Data Onboarding with Splunk
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analytics
 
SHOWDOWN: Threat Stack vs. Red Hat AuditD
SHOWDOWN: Threat Stack vs. Red Hat AuditDSHOWDOWN: Threat Stack vs. Red Hat AuditD
SHOWDOWN: Threat Stack vs. Red Hat AuditD
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 

Recently uploaded

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Recently uploaded (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

The Incremental Path to Observability

  • 1. The Incremental Path to Observability @eanakashima
  • 3.
  • 4.
  • 5. Observability for Production Systems: debug with Events, Traces, and Logs in real time.
  • 6. Instrumenting your code so that you can deeply understand the state of your system just by observing its outputs Observability
  • 7.
  • 8. Instrumenting your code so that you can deeply understand the state of your system just by observing its outputs Observability
  • 9. How you collect the data you need to achieve observability — usually code you write or include in your applications and services Instrumentation
  • 11. Why trust you? Isn’t this… your job?
  • 12. We’re a tiny, early startup We have to work on building the product as fast as we can — not instrumentation
  • 13. We’re a tiny, early startup ~3% of initial project == instrumentation ~1% of ongoing work == instrumentation
  • 14. We’re a tiny, early startup ~3% of initial project == instrumentation ~1% of ongoing work == instrumentation Toshok Christine Charity Ben Ginsu
  • 15. The Incremental Path 1. Metrics & Logs 1a. Structured Logs 2. Events 3. Traces 4. Gotchas & Next Steps
  • 16. The Incremental Path 1. Metrics & Logs 1a. Structured Logs 2. Events 3. Traces 4. Gotchas & Next Steps
  • 17. 1. Metrics & Logs
  • 18. Metrics & Logs • Metrics: signalFX metrics SaaS (could have been Datadog, Wavefront, Prometheus, etc.) • We still get ~1-2 useful metrics-based alerts per week • Logs: just regular log files written to disk on EC2 (no log aggregation SaaS) • Logs are a final debugging step, not a starting one
  • 19. “SSH’ing into any server is an antipattern, a failure” - Charity Majors
  • 20. Metrics & Logs • How: the lowest-effort way you won’t regret later • Why: don’t fly blind • What you get: host-based & aggregated alerts; logs of wtf happened • When you’re ready for the next step: high-cardinality questions your metrics & logs can’t quickly answer
  • 21. High-cardinality questions your metrics & logs can’t answer What happened here?
  • 23. 1a. Structured Logs • Key-value pairs (JSON) • Self-describing • Human-readable, just like traditional logs https://www.honeycomb.io/blog/you-could-have-invented-structured-logging/
  • 24.
  • 25. • • • • A good transitional step to using Events. Send to your Events backend. • Key-value pairs (JSON) • Self-describing • Human-readable, just like traditional logs 1a. Structured Logs
  • 29. • Can look like structured logs; might even be structured logs! • Key-value pairs (JSON) • Represent a complete unit of work for your system 2. Events https://www.honeycomb.io/blog/how-are-structured-logs-different-from-events/
  • 30. What is an event? 1 event == 1 unit of work ~= 1 request
  • 31. "Host": "127.0.0.1:8080", "IsXHR": true, "Method": "POST", "RequestURI": "/user_event/page-unload", "ResponseContentLength": 443, "ResponseHttpStatus": 200, "ResponseTime_ms": 123, "Timestamp": "2018-03-02T06:14:57.206349701Z", "UserEmail": "nathan@honeycomb.io", "UserID": 18, "availability_zone": "us-east-1b", "build_id": "6552", "env": "dogfood", "infra_type": "aws_instance", "instance_type": "t2.micro", "memory_inuse": 15450056, "num_goroutines": 56, "request_id": "poodle-a38f5e39/5fIUGkX5D1-001814", "server_hostname": "poodle-a38f5e39", "type": "request" },
  • 32. High cardinality Fields that may have many unique values Common examples: • email address • username / user id / team id • server hostname • IP address • user agent string • build id • request url • feature flags / flag combinations
  • 33. if ok { ev.AddField("user.id", userID) } return ev } func signupHandlerGet(w http.ResponseWriter, r *http.Request) { var err error ev := eventFromRequest(r) defer addFinalErr(&err, ev) tmpl := filepath.Join(templatesDir, "signup.html"), tmplData := struct { ErrorMessage string }{} if err = tmpl.Execute(w, tmplData); err != nil { log.Print(err) } } See https://github.com/honeycombio/examples for more
  • 35. Where do I start instrumenting? Start at the edge. Instrument the surfaces customers touch first.
  • 37. “Nines don’t matter if users aren’t happy.” - Charity Majors, all the time
  • 38. Storytime: the mysterious metrics spike
  • 39. Storytime: the mysterious metrics spike
  • 40. Storytime: the mysterious metrics spike
  • 41. Storytime: the mysterious metrics spike
  • 42. Storytime: the mysterious metrics spike
  • 43. Storytime: the mysterious metrics spike
  • 44. Storytime: the mysterious metrics spike
  • 45. Storytime: the mysterious metrics spike
  • 46. • Best: dedicated Event tool — Honeycomb, NewRelic Insights, etc. • Good: tracing tool (just send root spans) — Honeycomb, Jaeger, Lightstep… • Ok: log aggregation & analytics tool (if it’s blazing fast) • ¯_(ツ)_/¯: metrics tool (if it has high-cardinality support for many tags) Tool Choice
  • 47. Events • How: bit by bit, starting from the edge • Why: understand faster • What you get: context, high-cardinality support • When you’re ready for the next step: you’re spending lots of time on backend perf issues, you have complex multi-service bugs, or your team no longer has a complete mental model of how your services talk to each other
  • 50. Events vs. Traces • Traces require fields like parent_id, span_id, trace_id, duration • Traces are composed of many Events — usually called Spans when they are part of a larger Trace
  • 52. Instrumentation choices • Better turnkey instrumentation (db & http wrappers, etc.): • Vendor agent for an APM product (New Relic, Datadog) • Vendor integration for a tracing product (Honeycomb) • More open: • OpenCensus / OpenTracing (merging)
  • 53. metadata["durationMs"] = float64(time.Since(start)) / float64(time.Millisecond) if parentID := parentIDFromContext(ctx); parentID != "" { metadata["parentId"] = parentID } ev := libhoney.NewEvent() ev.Timestamp = start ev.Add(metadata) ev.Send() } func editHandler(w http.ResponseWriter, r *http.Request, title string) { loadPageID := newID() loadPageStart := time.Now() p, err := loadPage(newContextWithParentID(r.Context(), loadPageID), title) sendSpan("loadPage", loadPageID, loadPageStart, r.Context(), map[string]interface{}{"title": title, "error": err}) if err != nil { p = &Page{Title: title} } renderID := newID() renderStart := time.Now() renderTemplate(newContextWithParentID(r.Context(), renderID), w, "edit", p) sendSpan("renderTemplate", renderID, renderStart, r.Context(), map[string]interface{}{"template": "edit"}) } See https://github.com/honeycombio/examples for more
  • 54. Writing tracing instrumentation code: Somewhat scary! Take it a span at a time
  • 55. Switching our web app to traces • 1 week hackathon for ~2 people to get the big plumbing in • 3 months of 2-3 engineers adding context in our spare time (yikes) • Ongoing instrumentation improvements as we need them, an hour or two per month
  • 58. Storytime: the stampede // SaveToCache caches the dataset by ID func (d *Dataset) SaveToCache(ctx context.Context, cacheService cache.Service, mysqlService mysql.Service) { cacheKey := datasetCacheKey(d.ID) cacheService.SetWithTTL(ctx, cacheKey, d, datasetCacheTTL, func(innerCtx context.Context, key string, value interface{}) (interface{}, error) { return FindDatasetByID(innerCtx, mysqlService, d.ID) }) }
  • 62. Tracing • How: start at your hot spots • Why: dive deeper • What you get: faster perf & distributed systems debugging • When you’re ready for the next step: lol no, you’ll be working on tracing instrumentation forever
  • 63. How not to shoot your own foot
  • 64. Sampling • Tracing sends much more data — don’t surprise your CFO with a double- digit percent increase in your AWS bill like I did! • Select & send representative events, not all events. • Use the dynsampler library or write your own logic • Be sure to set the `sample_rate` field on your events Canopy paper: https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/ Ben’s Sampling talk: https://www.usenix.org/conference/lisa17/conference-program/presentation/hartshorne
  • 65. Schema churn The best schema: • Is consistent between services • Plans ahead for field names (follow OpenCensus field conventions?) • Uses namespacing!
  • 67. Start in !prod Use events or traces to debug something low-risk and high-value, like: • build pipeline — find bottlenecks • test suite — quantify your test flakiness
  • 68. Build pipeline: find bottlenecks
  • 69. Build pipeline: find bottlenecks
  • 70. Build pipeline: find bottlenecks