SlideShare une entreprise Scribd logo
1  sur  95
Télécharger pour lire hors ligne
Patchwork Data at Etsy
        Matt Walker
Etsy



       June




2005          2007    2009   2011   2013
What happened?
We don’t like to talk about it
Okay, we do

•   http://codeascraft.etsy.com

•   https://www.etsy.com/codeascraft/talks



•   http://kongscreenprinting.com
Catch Phrases

•   Continuous deployment

•   Blameless postmortems

•   Measure everything

•   Continuous experimentation
Metrics-Driven Development


•   Ganglia

•   StatsD/Graphite

•   Splunk
Scaling a Traditional RDBMS


•   Sharded MySQL

•   memcached

•   Object-relational mapping in PHP
December




2005   2007   2009              2011   2013
Adtuitive


•   Online advertising network

•   Match forum post with rich product advertisements

•   Unafraid of scaling across Etsy sellers
Adtuitive


•   Amazon Web Services

•   JRuby

•   Rails
LAMP Stack for Big Data
•   HDFS                               •   Pig

•   MapReduce                          •   Oozie

•   HBase                              •   Avro

•   Hive                               •   Zookeeper

•   Flume

•   JDBC/ODBC    http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/


•   Hue
LAMP Stack for Big Data
•   HDFS                               •   Pig

•   MapReduce                          •   Oozie

•   HBase                              •   Avro

•   Hive                               •   Zookeeper

•   Flume

•   JDBC/ODBC    http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/


•   Hue
LAMP Stack for Big Data
•   HDFS S3               •   Pig Cascading

•   MapReduce (Elastic)   •   Oozie

•   HBase                 •   Avro TupleSerialization

•   Hive                  •   Zookeeper

•   Flume

•   JDBC/ODBC

•   Hue
Powered by MapReduce

•   ETL

•   Analytics

•   A/B testing

•   Recommenders

•   Search
Applications
•   Log ETL                          •   A/B Analyzer

•   Database snapshotter             •   Catapult

•   TasteTest                        •   Distributed search indexing

•   Facebook Gift Recommender        •   Fast Game (search index)

•   Complimentary/similar listings   •   Search autosuggest

•   Funnel Cake                      •   SearchAds

•   Feature Funnel                   •   SCRAM ETL (fraud detection)
Applications
•   Log ETL                          •   A/B Analyzer

•   Database snapshotter             •   Catapult

•   TasteTest                        •   Distributed search indexing

•   Facebook Gift Recommender        •   Fast Game (search index)

•   Complimentary/similar listings   •   Search autosuggest

•   Funnel Cake                      •   SearchAds

•   Feature Funnel                   •   SCRAM ETL (fraud detection)
Catapult


•   End-to-end success story

•   Extremely valuable for a web shop
Relevancy Thursdays



                      January




2005   2007    2009             2011   2013
Relevancy Thursdays


•   Switch default sort order to relevance

•   Each Thursday in January
Relevancy Thursdays


•   Default search order was recency

•   Relisting was our equivalent of advertising

•   $0.20 updated your listing’s timestamp
Relevancy Thursdays


•   Recency was meant to support “freshness” in search results

•   Search originated as PostgreSQL query

•   Converted to Solr to scale
What happens if we switch to
        relevance?
Relevancy Thursdays


•   No A/B testing framework

•   No event logs

•   Limping along with Google Analytics
First Log Analysis



                        February




2005   2007      2009              2011   2013
First Log Analysis


•   Raw web access logs

•   URL- and ref tag-based

•   Regex parser
Heyday of Tooling
•   A/B framework

•   Front end event logger

•   Database snapshotter

•   Barnum and Bailey

•   Custom operator library

•   Loaders
LAMP Stack for Big Data
•   HDFS S3               •   Pig Cascading

•   MapReduce (Elastic)   •   Oozie

•   HBase                 •   Avro TupleSerialization

•   Hive                  •   Zookeeper

•   Flume

•   JDBC/ODBC

•   Hue
LAMP Stack for Big Data
•   HDFS S3                         •   Pig Cascading

•   MapReduce (Elastic)             •   Oozie Barnum

•   HBase                           •   Avro TupleSerialization

•   Hive                            •   Zookeeper

•   Flume Akamai

•   JDBC/ODBC snapshotter/loaders

•   Hue
A/B Framework


•   Ramp-ups + A/B testing

•   Feature flag development
Self-service analytics for any A/B
         test on the site
A/B Framework



                      June




2005   2007    2009          2011   2013
A/B Analyzer



                                    November




2005   2007        2009      2011              2013
Why did it take so long?


•   Non-web developers learning the PHP stack

•   Failed experiments with “easier to use” MapReduce tools

•   Realizing self-service analytics was what Etsy needed
Catapult



                                February




2005   2007      2009    2011   2013
Catapult


•   A/B Analyzer + Launch Calendar

•   Full product lifecycle
LAMP Stack for Big Data
•   HDFS S3                         •   Pig Cascading

•   MapReduce (Elastic)             •   Oozie Barnum

•   HBase                           •   Avro TupleSerialization

•   Hive                            •   Zookeeper

•   Flume Akamai

•   JDBC/ODBC snapshotter/loaders

•   Hue
LAMP Stack for Big Data
•   HDFS                            •   Pig Cascading

•   MapReduce                       •   Oozie

•   HBase                           •   Avro TupleSerialization

•   Hive Vertica                    •   Zookeeper

•   Flume logrotate

•   JDBC/ODBC snapshotter/loaders

•   Hue
Computation Models


•   Batch

•   Interactive

•   Streaming
Batch
Cascading
RDBMS / Cascading
         SQL              cascading.jruby

Query Planner/Optimizer     Cascading

   Execution Engine        MapReduce

        Storage               HDFS
cascading.jruby
cascading.jruby

•   Productivity: no compile

•   Reuse: factor out structure

•   Efficiency: no JRuby runtime

•   Optimization: move aggregations map-side
A nice constructor
cascading.jruby
Productivity

•   Job templates

•   Reloader

•   Cascading local mode

•   Sampled data
Reuse
Reuse
Field Names
Efficiency


•   Just a constructor

•   Calls into Cascading API

•   No JRuby runtime on cluster
Optimization
Tuple Data Model
UDFs
Scalding


•   Distributed collections

•   Function literals replace UDFs
Interactive
Vertica
Sharded MySQL


•   Borrowed from Flickr

•   Works
Thou Shalt Not Join
Hive



                      January




2005   2007    2009   2011      2013
Hive Turned Off



                            April




2005   2007     2009    2011        2013
Hive

•   Slow

•   Sensitive

•   Operational burden

•   Educational burden
Vertica


•   Offline copy of shards, master, auxiliary databases

•   Joins are easy

•   Reasonable latency
Vertica



                               November




2005   2007     2009    2011       2013
Vertica


•   Game changer at Etsy

•   High demand for joins

•   Rapid prototyping data pipelines
RDBMS / Cascading
         SQL              cascading.jruby

Query Planner/Optimizer     Cascading

   Execution Engine        MapReduce

        Storage               HDFS
Back to MapReduce

•   Event logs

•   Schedule

•   Load data in prod

•   Scale
Vertica


•   Not Hive, Impala, Shark, etc.

•   May change our minds
Streaming
Not Powered by MapReduce


•   Activity Feed

•   Shop Stats
Etsyweb


•   memcached

•   Gearman

•   Sharded MySQL
Usecases


•   Trending

•   Fraud detection

•   ?
Turns out people don’t make
product decisions in real time


 http://mcfunley.com/whom-the-gods-would-destroy-they-first-give-real-time-analytics
Summing Up


•   Be glad you’re living in the future

•   Automated tools for the common case

•   Don’t be afraid to experiment
Image Credits
•   http://kongscreenprinting.com/what-we-do-    •   http://www.globaltimes.cn/
    showcase                                         SPECIALCOVERAGE/Top10Peopleof2011.aspx

•   http://animal.discovery.com                  •   http://www.theculturemap.com/scream-time-
                                                     edvard-munch-museum/
•   http://www.rallyrace.com/turning-over-the-
    stone-event-production-basics/               •   http://www.repentamerica.com/webelieve.html

•   http://www.flickr.com/photos/bbalaji/         •   https://soundcloud.com/tearland/tl-hive
    2443820505/
                                                 •   http://pocketnow.com/2012/08/02/wifi-vs-data-
•   http://www.madeyoulaugh.com/funny_photos/        speed-vs-battery-life/bush-scratching-head
    caveman_harley/caveman_harley.jpg

•   http://theundercoverrecruiter.com/6-ways-
    catapult-your-job-search-after-layoff/
Contact / Reference

•   Matt Walker

•   @data_daddy

•   http://codeascraft.etsy.com/

•   http://www.etsy.com/codeascraft/talks

Contenu connexe

En vedette

Migrating from PostgreSQL to MySQL Without Downtime
Migrating from PostgreSQL to MySQL Without DowntimeMigrating from PostgreSQL to MySQL Without Downtime
Migrating from PostgreSQL to MySQL Without DowntimeMatt Graham
 
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Gregg Donovan
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comJohn Allspaw
 
Solr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanSolr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanGregg Donovan
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsCloudera, Inc.
 
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorOutages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorJohn Allspaw
 
Building a Successful Organization By Mastering Failure
Building a Successful Organization By Mastering FailureBuilding a Successful Organization By Mastering Failure
Building a Successful Organization By Mastering Failurejgoulah
 
Scaling Etsy: What Went Wrong, What Went Right
Scaling Etsy: What Went Wrong, What Went RightScaling Etsy: What Went Wrong, What Went Right
Scaling Etsy: What Went Wrong, What Went RightRoss Snyder
 
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012Nick Galbreath
 
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)John Allspaw
 
Code as Craft: Building a Strong Engineering Culture at Etsy
Code as Craft: Building a Strong Engineering Culture at EtsyCode as Craft: Building a Strong Engineering Culture at Etsy
Code as Craft: Building a Strong Engineering Culture at EtsyChad Dickerson
 
Mobile App Feature Configuration and A/B Experiments
Mobile App Feature Configuration and A/B ExperimentsMobile App Feature Configuration and A/B Experiments
Mobile App Feature Configuration and A/B Experimentslacyrhoades
 

En vedette (13)

Migrating from PostgreSQL to MySQL Without Downtime
Migrating from PostgreSQL to MySQL Without DowntimeMigrating from PostgreSQL to MySQL Without Downtime
Migrating from PostgreSQL to MySQL Without Downtime
 
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
 
DevTools at Etsy
DevTools at EtsyDevTools at Etsy
DevTools at Etsy
 
Go or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.comGo or No-Go: Operability and Contingency Planning at Etsy.com
Go or No-Go: Operability and Contingency Planning at Etsy.com
 
Solr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanSolr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg Donovan
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
 
Outages, PostMortems, and Human Error
Outages, PostMortems, and Human ErrorOutages, PostMortems, and Human Error
Outages, PostMortems, and Human Error
 
Building a Successful Organization By Mastering Failure
Building a Successful Organization By Mastering FailureBuilding a Successful Organization By Mastering Failure
Building a Successful Organization By Mastering Failure
 
Scaling Etsy: What Went Wrong, What Went Right
Scaling Etsy: What Went Wrong, What Went RightScaling Etsy: What Went Wrong, What Went Right
Scaling Etsy: What Went Wrong, What Went Right
 
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
 
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
 
Code as Craft: Building a Strong Engineering Culture at Etsy
Code as Craft: Building a Strong Engineering Culture at EtsyCode as Craft: Building a Strong Engineering Culture at Etsy
Code as Craft: Building a Strong Engineering Culture at Etsy
 
Mobile App Feature Configuration and A/B Experiments
Mobile App Feature Configuration and A/B ExperimentsMobile App Feature Configuration and A/B Experiments
Mobile App Feature Configuration and A/B Experiments
 

Dernier

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Patchwork Data at Etsy

  • 1. Patchwork Data at Etsy Matt Walker
  • 2.
  • 3.
  • 4. Etsy June 2005 2007 2009 2011 2013
  • 6. We don’t like to talk about it
  • 7. Okay, we do • http://codeascraft.etsy.com • https://www.etsy.com/codeascraft/talks • http://kongscreenprinting.com
  • 8. Catch Phrases • Continuous deployment • Blameless postmortems • Measure everything • Continuous experimentation
  • 9. Metrics-Driven Development • Ganglia • StatsD/Graphite • Splunk
  • 10. Scaling a Traditional RDBMS • Sharded MySQL • memcached • Object-relational mapping in PHP
  • 11. December 2005 2007 2009 2011 2013
  • 12. Adtuitive • Online advertising network • Match forum post with rich product advertisements • Unafraid of scaling across Etsy sellers
  • 13. Adtuitive • Amazon Web Services • JRuby • Rails
  • 14.
  • 15. LAMP Stack for Big Data • HDFS • Pig • MapReduce • Oozie • HBase • Avro • Hive • Zookeeper • Flume • JDBC/ODBC http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/ • Hue
  • 16. LAMP Stack for Big Data • HDFS • Pig • MapReduce • Oozie • HBase • Avro • Hive • Zookeeper • Flume • JDBC/ODBC http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/ • Hue
  • 17. LAMP Stack for Big Data • HDFS S3 • Pig Cascading • MapReduce (Elastic) • Oozie • HBase • Avro TupleSerialization • Hive • Zookeeper • Flume • JDBC/ODBC • Hue
  • 18. Powered by MapReduce • ETL • Analytics • A/B testing • Recommenders • Search
  • 19. Applications • Log ETL • A/B Analyzer • Database snapshotter • Catapult • TasteTest • Distributed search indexing • Facebook Gift Recommender • Fast Game (search index) • Complimentary/similar listings • Search autosuggest • Funnel Cake • SearchAds • Feature Funnel • SCRAM ETL (fraud detection)
  • 20. Applications • Log ETL • A/B Analyzer • Database snapshotter • Catapult • TasteTest • Distributed search indexing • Facebook Gift Recommender • Fast Game (search index) • Complimentary/similar listings • Search autosuggest • Funnel Cake • SearchAds • Feature Funnel • SCRAM ETL (fraud detection)
  • 21. Catapult • End-to-end success story • Extremely valuable for a web shop
  • 22. Relevancy Thursdays January 2005 2007 2009 2011 2013
  • 23. Relevancy Thursdays • Switch default sort order to relevance • Each Thursday in January
  • 24. Relevancy Thursdays • Default search order was recency • Relisting was our equivalent of advertising • $0.20 updated your listing’s timestamp
  • 25. Relevancy Thursdays • Recency was meant to support “freshness” in search results • Search originated as PostgreSQL query • Converted to Solr to scale
  • 26. What happens if we switch to relevance?
  • 27. Relevancy Thursdays • No A/B testing framework • No event logs • Limping along with Google Analytics
  • 28.
  • 29.
  • 30. First Log Analysis February 2005 2007 2009 2011 2013
  • 31. First Log Analysis • Raw web access logs • URL- and ref tag-based • Regex parser
  • 32.
  • 33.
  • 34. Heyday of Tooling • A/B framework • Front end event logger • Database snapshotter • Barnum and Bailey • Custom operator library • Loaders
  • 35. LAMP Stack for Big Data • HDFS S3 • Pig Cascading • MapReduce (Elastic) • Oozie • HBase • Avro TupleSerialization • Hive • Zookeeper • Flume • JDBC/ODBC • Hue
  • 36. LAMP Stack for Big Data • HDFS S3 • Pig Cascading • MapReduce (Elastic) • Oozie Barnum • HBase • Avro TupleSerialization • Hive • Zookeeper • Flume Akamai • JDBC/ODBC snapshotter/loaders • Hue
  • 37. A/B Framework • Ramp-ups + A/B testing • Feature flag development
  • 38. Self-service analytics for any A/B test on the site
  • 39. A/B Framework June 2005 2007 2009 2011 2013
  • 40. A/B Analyzer November 2005 2007 2009 2011 2013
  • 41. Why did it take so long? • Non-web developers learning the PHP stack • Failed experiments with “easier to use” MapReduce tools • Realizing self-service analytics was what Etsy needed
  • 42.
  • 43.
  • 44. Catapult February 2005 2007 2009 2011 2013
  • 45. Catapult • A/B Analyzer + Launch Calendar • Full product lifecycle
  • 46.
  • 47.
  • 48.
  • 49. LAMP Stack for Big Data • HDFS S3 • Pig Cascading • MapReduce (Elastic) • Oozie Barnum • HBase • Avro TupleSerialization • Hive • Zookeeper • Flume Akamai • JDBC/ODBC snapshotter/loaders • Hue
  • 50. LAMP Stack for Big Data • HDFS • Pig Cascading • MapReduce • Oozie • HBase • Avro TupleSerialization • Hive Vertica • Zookeeper • Flume logrotate • JDBC/ODBC snapshotter/loaders • Hue
  • 51. Computation Models • Batch • Interactive • Streaming
  • 52.
  • 53. Batch
  • 55. RDBMS / Cascading SQL cascading.jruby Query Planner/Optimizer Cascading Execution Engine MapReduce Storage HDFS
  • 57. cascading.jruby • Productivity: no compile • Reuse: factor out structure • Efficiency: no JRuby runtime • Optimization: move aggregations map-side
  • 60. Productivity • Job templates • Reloader • Cascading local mode • Sampled data
  • 61. Reuse
  • 62. Reuse
  • 64.
  • 65. Efficiency • Just a constructor • Calls into Cascading API • No JRuby runtime on cluster
  • 68. UDFs
  • 69. Scalding • Distributed collections • Function literals replace UDFs
  • 70.
  • 71.
  • 72.
  • 75. Sharded MySQL • Borrowed from Flickr • Works
  • 77. Hive January 2005 2007 2009 2011 2013
  • 78. Hive Turned Off April 2005 2007 2009 2011 2013
  • 79. Hive • Slow • Sensitive • Operational burden • Educational burden
  • 80. Vertica • Offline copy of shards, master, auxiliary databases • Joins are easy • Reasonable latency
  • 81. Vertica November 2005 2007 2009 2011 2013
  • 82. Vertica • Game changer at Etsy • High demand for joins • Rapid prototyping data pipelines
  • 83.
  • 84. RDBMS / Cascading SQL cascading.jruby Query Planner/Optimizer Cascading Execution Engine MapReduce Storage HDFS
  • 85. Back to MapReduce • Event logs • Schedule • Load data in prod • Scale
  • 86. Vertica • Not Hive, Impala, Shark, etc. • May change our minds
  • 88. Not Powered by MapReduce • Activity Feed • Shop Stats
  • 89. Etsyweb • memcached • Gearman • Sharded MySQL
  • 90. Usecases • Trending • Fraud detection • ?
  • 91.
  • 92. Turns out people don’t make product decisions in real time http://mcfunley.com/whom-the-gods-would-destroy-they-first-give-real-time-analytics
  • 93. Summing Up • Be glad you’re living in the future • Automated tools for the common case • Don’t be afraid to experiment
  • 94. Image Credits • http://kongscreenprinting.com/what-we-do- • http://www.globaltimes.cn/ showcase SPECIALCOVERAGE/Top10Peopleof2011.aspx • http://animal.discovery.com • http://www.theculturemap.com/scream-time- edvard-munch-museum/ • http://www.rallyrace.com/turning-over-the- stone-event-production-basics/ • http://www.repentamerica.com/webelieve.html • http://www.flickr.com/photos/bbalaji/ • https://soundcloud.com/tearland/tl-hive 2443820505/ • http://pocketnow.com/2012/08/02/wifi-vs-data- • http://www.madeyoulaugh.com/funny_photos/ speed-vs-battery-life/bush-scratching-head caveman_harley/caveman_harley.jpg • http://theundercoverrecruiter.com/6-ways- catapult-your-job-search-after-layoff/
  • 95. Contact / Reference • Matt Walker • @data_daddy • http://codeascraft.etsy.com/ • http://www.etsy.com/codeascraft/talks