SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Databus




1/29/2013   Recruiting Solutions
            `                        Databus   1
INTRODUCTION


  `            2
LinkedIn by Numbers
 World’s largest professional network
 187M+ members world-wide as of Q3 2012
   Growing at the rate of two per second
 85 of Fortune 100 companies use Talent Solutions
  to hire
 > 2.6M company pages
 > 4B search queries
 75K+ developers leveraging out APIs
 1.3M unique publishers


     `                    Databus                3
The Consequence of Specialization in
           Data Systems
Data Flow is essential
Data Consistency is critical !!!




        `
Solution: Databus



              Standardi
               Standardi    Standardi
                             Standardi   Standardi
                                          Standardi   Standardi
                                                       Standardi
                Standardi       Search       Graph         Read
  Updates




                zation
                 zation       zation
                               zation      zation
                                            zation      zation
                                                         zation
                  zation         Index       Index       Replicas




Primary
   DB                       Data Change Events

                               Databus

  `                                                                 5
Two Ways


    Application code dual    Extract changes from
    writes to database and   database commit log
    pub-sub system




    Easy on the surface      Tough but possible

    Consistent?              Consistent!!!



`
Key Design Decisions : Semantics
• Logical clocks attached to the source
  – Physical offsets could be used for internal
    transport
  – Simplifies data portability
• Pull model
  – Restarts are simple
  – Derived State = f (Source state, Clock)
  – + Idempotence = Timeline Consistent!

     `                                            7
Key Design Decisions : Systems
• Isolate fast consumers from slow consumers
  – Workload separation between online, catch-up,
    bootstrap
• Isolate sources from consumers
  – Schema changes
  – Physical layout changes
  – Speed mismatch
• Schema-aware
  – Filtering, Projections
  – Typically network-bound  can burn more CPU

     `                                              8
Requirements
•   Timeline consistency
•   Guaranteed, at least once delivery
•   Low latency
•   Schema evolution
•   Source independence
•   Scalable consumers
•   Handle for slow/new consumers without
    affecting happy ones (look-back requirements)

      `                                         9
ARCHITECTURE


  `            10
0
                          Initial Design (2007)                                   Happy
                                                                                 Consumer
         Source clock
            timer




                                                                                       …
             SCN

                                 Direct Pull              Relay                    Happy
                            DB                           In Memory                Consumer
 70000                                                     Buffer
                                  Proxied
             3 hrs
                                    Pull
100000       Relay
102400                                                                             Slow
                     DB
                                                                                 Consumer




Pros:
                                                  Cons:
1. Consumer Scaling
                                                  Slow consumers overwhelming the DB
2. Some isolation


                `                              Databus                                 11
Software Architecture
                 Four Logical Components

                  • Fetcher
                      – Fetch from db, relay…
                  • Log Store
                      – Store log snippet
                  • Snapshot Store
                      – Store moving data
                        snapshot
                  • Subscription Client
                      – Orchestrate pull
                        across these


`
0
        Source clock
           timer
            SCN
                                  The Databus System                                    Happy
                       Snapshot                                                        Consumer




                                                                                         …
                                  infinite
30000                  Log
                                                           Relay                        Happy
                       10 days                        In Memory                        Consumer
 70000   Relay
                                                        Buffer
 80000
 90000     3 hrs
100000
102400                                                                                   Slow
                       DB
                                                                                       Consumer


                                                                   Server


                                             Log Storage              Snapshot Store


                                                                   Bootstrap Service

                   `                                                                       13
The Relay
•   Change event buffering (~ 2 – 7 days)
•   Low latency (10-15 ms)
•   Filtering, Projection
•   Hundreds of consumers per relay
•   Scale-out, High-availability through
    redundancy



       `
Deployment Options




Option 1: Peered Deployment   Option 2: Clustered Deployment

     `
The Bootstrap Service
•   Catch-all for slow / new consumers
•   Isolate source OLTP instance from large scans
•   Log Store + Snapshot Store
•   Optimizations
    – Periodic merge
    – Predicate push-down
    – Catch-up versus full bootstrap
• Guaranteed progress for consumers via chunking
• Implementations
    – Database (MySQL)
    – Raw Files
• Bridges the continuum between stream and batch systems

       `
The Consumer Client Library
• Glue between Databus infra and business logic
  in the consumer
• Isolates the consumer from changes in the
  databus layer.
• Switches between relay and bootstrap as
  needed
• API
  – Callback with transactions
  – Iterators over windows

    `
Fetcher Implementations
• Oracle
  – Trigger-based
• MySQL
  – Custom-storage-engine based
• In Labs
  – Alternative implementations for Oracle
  – OpenReplicator integration for MySQL


     `
Meta-data Management
• Event definition, serialization and transport
  – Avro
• Oracle, MySQL
  – Avro definition generated from the table schema
• Schema evolution
  – Only backwards-compatible changes allowed
• Isolation between upgrades on producer and
  consumer

     `
Scaling the consumers
                (Partitioning)
• Server-side filtering
  – Range, mod, hash
  – Allows client to control partitioning function
• Consumer groups
  – Distribute partitions evenly across a group
  – Move partitions to available consumers on failure
  – Minimize re-processing



     `
A NEW CONSUMER


 `               21
Development with Databus – Client
                   Library
    Databus Client

                     Consumers
                      Consumers

                          implement



          Stream Event      Bootstrap Event
             Callback          Callback                        Client API
               API                API

                                   Databus Client Library

onDataEvent(DbusEvent, Decoder)                  register(consumers, sources , filter)
…                                                start() ,
…
                                                 shutdown(),

           `                           Databus                                      22
Databus Consumer Implementation
class MyConsumer
      extends AbstractDatabusStreamConsumer
{
   ConsumerCallbackResult onDataEvent(DbusEvent e,
                                       DbusEventDecoder d){
    //use map-like Avro GenericRecord
    GenericRecord g = d.getGenericRecord(e, null);
    //or use the auto-generated Java class
    MyEvent e = d.getTypedValue(e, null,
                                            MyEvent.class);
    …
    return ConsumerCallbackResult.SUCCESS;
  }
}

     `                     Databus                       23
Starting the client
public void main(String[]) {
  //configure
  DatabusHttpClientImpl.Config clientConfig =
                          new DatabusHttpClientImpl.Config();
  clientConfig.loadFromFile(“mydbus”, “mdbus.props”);
  DatabusHttpClientImpl client =
               new DatabusHttpClientImpl(clientConfig);
  //register callback
  MyConsumer callback = new MyConsumer();
  client.registerDatabusStreamListener(callback,
          null, "com.linkedin.events.member2.MemberProfile”);
  //start client library
  client.startAndBlock();
}

        `                    Databus                        24
Event Callback APIs
•




    `           Databus       25
PERFORMANCE


 `            26
Relay Throughput




`         Databus      27
Consumer Throughput




`           Databus       28
End-End Latency




`         Databus     29
Snapshot vs Catchup




`           Databus       30
Recruiting Solutions   31

Contenu connexe

Tendances

Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databasesDipti Borkar
 
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleHow LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleLinkedIn
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache SparkMatt Ingenthron
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Kanak Biscuitwala
 
Aceleracion de aplicacione 2
Aceleracion de aplicacione 2Aceleracion de aplicacione 2
Aceleracion de aplicacione 2jfth
 
In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! Uri Cohen
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchDataWorks Summit/Hadoop Summit
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0EDB
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodePivotalOpenSourceHub
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 

Tendances (20)

Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databases
 
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleHow LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache Spark
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
 
Aceleracion de aplicacione 2
Aceleracion de aplicacione 2Aceleracion de aplicacione 2
Aceleracion de aplicacione 2
 
In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified!
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache Geode
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 

Similaire à Introduction to Databus

Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...SL Corporation
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseMark Ginnebaugh
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflowrjmurphyslideshare
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 ServersHow to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 ServersNew Relic
 
Times Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasTimes Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasORACLE USER GROUP ESTONIA
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inliqiang xu
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure bloomreacheng
 
Lync 2010 High Availability
Lync 2010 High AvailabilityLync 2010 High Availability
Lync 2010 High AvailabilityHarold Wong
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisScaleOut Software
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineSunil Nagaraj
 
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Michael Noel
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Zeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCZeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCArvind Krishnaa
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridScaleOut Software
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scalexcbsmith
 
Oracle in the Cloud
Oracle in the CloudOracle in the Cloud
Oracle in the Cloudzain1425
 

Similaire à Introduction to Databus (20)

Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflow
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 ServersHow to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
 
Times Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasTimes Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo Ludas
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_in
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
 
Lync 2010 High Availability
Lync 2010 High AvailabilityLync 2010 High Availability
Lync 2010 High Availability
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Zeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCZeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMC
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scale
 
Oracle in the Cloud
Oracle in the CloudOracle in the Cloud
Oracle in the Cloud
 

Plus de Amy W. Tang

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using HelixAmy W. Tang
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph PresentationAmy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesAmy W. Tang
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with HelixAmy W. Tang
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang
 

Plus de Amy W. Tang (11)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 

Dernier

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 

Dernier (20)

Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 

Introduction to Databus

  • 1. Databus 1/29/2013 Recruiting Solutions ` Databus 1
  • 3. LinkedIn by Numbers  World’s largest professional network  187M+ members world-wide as of Q3 2012  Growing at the rate of two per second  85 of Fortune 100 companies use Talent Solutions to hire  > 2.6M company pages  > 4B search queries  75K+ developers leveraging out APIs  1.3M unique publishers ` Databus 3
  • 4. The Consequence of Specialization in Data Systems Data Flow is essential Data Consistency is critical !!! `
  • 5. Solution: Databus Standardi Standardi Standardi Standardi Standardi Standardi Standardi Standardi Standardi Search Graph Read Updates zation zation zation zation zation zation zation zation zation Index Index Replicas Primary DB Data Change Events Databus ` 5
  • 6. Two Ways Application code dual Extract changes from writes to database and database commit log pub-sub system Easy on the surface Tough but possible Consistent? Consistent!!! `
  • 7. Key Design Decisions : Semantics • Logical clocks attached to the source – Physical offsets could be used for internal transport – Simplifies data portability • Pull model – Restarts are simple – Derived State = f (Source state, Clock) – + Idempotence = Timeline Consistent! ` 7
  • 8. Key Design Decisions : Systems • Isolate fast consumers from slow consumers – Workload separation between online, catch-up, bootstrap • Isolate sources from consumers – Schema changes – Physical layout changes – Speed mismatch • Schema-aware – Filtering, Projections – Typically network-bound  can burn more CPU ` 8
  • 9. Requirements • Timeline consistency • Guaranteed, at least once delivery • Low latency • Schema evolution • Source independence • Scalable consumers • Handle for slow/new consumers without affecting happy ones (look-back requirements) ` 9
  • 11. 0 Initial Design (2007) Happy Consumer Source clock timer … SCN Direct Pull Relay Happy DB In Memory Consumer 70000 Buffer Proxied 3 hrs Pull 100000 Relay 102400 Slow DB Consumer Pros: Cons: 1. Consumer Scaling Slow consumers overwhelming the DB 2. Some isolation ` Databus 11
  • 12. Software Architecture Four Logical Components • Fetcher – Fetch from db, relay… • Log Store – Store log snippet • Snapshot Store – Store moving data snapshot • Subscription Client – Orchestrate pull across these `
  • 13. 0 Source clock timer SCN The Databus System Happy Snapshot Consumer … infinite 30000 Log Relay Happy 10 days In Memory Consumer 70000 Relay Buffer 80000 90000 3 hrs 100000 102400 Slow DB Consumer Server Log Storage Snapshot Store Bootstrap Service ` 13
  • 14. The Relay • Change event buffering (~ 2 – 7 days) • Low latency (10-15 ms) • Filtering, Projection • Hundreds of consumers per relay • Scale-out, High-availability through redundancy `
  • 15. Deployment Options Option 1: Peered Deployment Option 2: Clustered Deployment `
  • 16. The Bootstrap Service • Catch-all for slow / new consumers • Isolate source OLTP instance from large scans • Log Store + Snapshot Store • Optimizations – Periodic merge – Predicate push-down – Catch-up versus full bootstrap • Guaranteed progress for consumers via chunking • Implementations – Database (MySQL) – Raw Files • Bridges the continuum between stream and batch systems `
  • 17. The Consumer Client Library • Glue between Databus infra and business logic in the consumer • Isolates the consumer from changes in the databus layer. • Switches between relay and bootstrap as needed • API – Callback with transactions – Iterators over windows `
  • 18. Fetcher Implementations • Oracle – Trigger-based • MySQL – Custom-storage-engine based • In Labs – Alternative implementations for Oracle – OpenReplicator integration for MySQL `
  • 19. Meta-data Management • Event definition, serialization and transport – Avro • Oracle, MySQL – Avro definition generated from the table schema • Schema evolution – Only backwards-compatible changes allowed • Isolation between upgrades on producer and consumer `
  • 20. Scaling the consumers (Partitioning) • Server-side filtering – Range, mod, hash – Allows client to control partitioning function • Consumer groups – Distribute partitions evenly across a group – Move partitions to available consumers on failure – Minimize re-processing `
  • 22. Development with Databus – Client Library Databus Client Consumers Consumers implement Stream Event Bootstrap Event Callback Callback Client API API API Databus Client Library onDataEvent(DbusEvent, Decoder) register(consumers, sources , filter) … start() , … shutdown(), ` Databus 22
  • 23. Databus Consumer Implementation class MyConsumer extends AbstractDatabusStreamConsumer { ConsumerCallbackResult onDataEvent(DbusEvent e, DbusEventDecoder d){ //use map-like Avro GenericRecord GenericRecord g = d.getGenericRecord(e, null); //or use the auto-generated Java class MyEvent e = d.getTypedValue(e, null, MyEvent.class); … return ConsumerCallbackResult.SUCCESS; } } ` Databus 23
  • 24. Starting the client public void main(String[]) { //configure DatabusHttpClientImpl.Config clientConfig = new DatabusHttpClientImpl.Config(); clientConfig.loadFromFile(“mydbus”, “mdbus.props”); DatabusHttpClientImpl client = new DatabusHttpClientImpl(clientConfig); //register callback MyConsumer callback = new MyConsumer(); client.registerDatabusStreamListener(callback, null, "com.linkedin.events.member2.MemberProfile”); //start client library client.startAndBlock(); } ` Databus 24
  • 25. Event Callback APIs • ` Databus 25
  • 27. Relay Throughput ` Databus 27
  • 28. Consumer Throughput ` Databus 28
  • 29. End-End Latency ` Databus 29
  • 30. Snapshot vs Catchup ` Databus 30