SlideShare une entreprise Scribd logo
1  sur  51
PB vs. Thrift vs. Avro



    Author: Igor Anishchenko

                     Lohika - May, 2012
Problem Statement
      Simple Distributed Architecture

                serialize      deserialize



                deserialize      serialize




  •    Basic questions are:

       •   What kind of protocol to use, and what data to transmit?

       •   Efficient mechanism for storing and exchanging data

       •   What to do with requests on the server side?
…and you want to scale your servers...


 •   When you grow beyond a simple architecture, you want..
     •   flexibility
     •   ability to grow
     •   latency
     •   and of course - you want it to be simple
How components talk

 •   Database protocols - fine.
 •   HTTP + maybe JSON/XML on the front - cool.
How components talk

 •   Database protocols - fine.
 •   HTTP + maybe JSON/XML on the front - cool.

 • But most of the times you have
     internal APIs.
Hasn't this been done before? (yes)


  •   SOAP
  •   CORBA
  •   DCOM, COM+
  •   JSON, Plain Text, XML
Should we pick up one of those? (no)
 •   SOAP
     •   XML, XML and more XML. Do we really need to parse so much XML?
 •   CORBA
     •   Amazing idea, horrible execution
     •   Overdesigned and heavyweight
 •   DCOM, COM+
     •   Embraced mainly in windows client software
 •   HTTP/JSON/XML/Whatever
     •   Okay, proven – hurray!
     •   But lack protocol description.
     •   You have to maintain both client and server code.
     •   You still have to write your own wrapper to the protocol.
     •   XML has high parsing overhead.
     •   (relatively) expensive to process; large due to repeated tags
Decision Time?

 As a developer - what are you looking for?

                   Be patient, I have something for you
                   on the subsequent slides!!
High level goals!

 •   Transparent interaction between multiple programming
     languages

     •   A language and platform neutral way of serializing
         structured data for use in communications protocols,
         data storage etc.
High level goals!

 •   Transparent interaction between multiple programming
     languages

     •   A language and platform neutral way of serializing
         structured data for use in communications
         protocols, data storage etc.

 •   Maintain Right balance between:
     •   Efficiency (how much time/space?)

     •   Ease and speed of development

     •   Availability of existing libraries and etc..
Consideration: Protocol Space


             {"deposit_money": "12345678"}
             JSON                        Binary
  '0x6d', '0x6f', '0x6e',   '0x01', '0xBC614E'
  '0x65', '0x79', '0x31',
  '0x32', '0x33', '0x34',
  '0x35', '0x36', '0x37',
  '0x38'




  Binary takes less space. No contest!
Consideration: Protocol Time


            JSON                      Binary
  Push down automata         No parser needed. The
  (PDA) parser (LL(1),       binary representation IS
  LR(1)) -- 1 character      [as close as to] the
  lookahead. Then, final     machine representation.
  translation from
  characters to native
  types (int, float, etc)



  Binary is way faster. No contest
Consideration: Protocol Ease of Use

            JSON                          Binary
 Brainless to learn            Need to manually write
 Popular                       code to define message
                               packets (total pain and
                               error prone!!!)

                                            or

                               Use a code generator like
                               Thrift (oh noes, I don't want
                               to learn something new!)

  Json is easier, binary is a pain.
Several smart people have attacked this problem over the
years and as a result there several good open source
alternatives to choose from




                      Here is where
         Data Interchange Protocols
                    comes in play…
Serialization Frameworks

                  XML, JSON,

          Protocol Buffers, BERT,

     BSON, Apache    Thrift, Message Pack,

       Etch, Hessian, ICE, Apache   Avro,
               Custom Protocol...
SF have some properties in common


 •   Interface Description (IDL)
 •   Performance
 •   Versioning
 •   Binary Format
Protocol Buffer

•   Designed ~2001 because everything else wasn’t that good those days

•   Production, proprietary in Google from 2001-2008, open-sourced since 2008

•   Battle tested, very stable, well trusted

•   Every time you hit a Google page, you're hitting several services and several PB
code

•   PB is the glue to all Google services

•   Official support for four languages: C++, Java, Python, and JavaScript

•   Does have a lot of third-party support for other languages (of highly variable quality)

•   Current Version - protobuf-2.4.1

•   BSD License
Apache Thrift

•   Designed by an X-Googler in 2007

•   Developed internally at Facebook, used extensively there

•   An open Apache project, hosted in Apache's Inkubator.

•   Aims to be the next-generation PB (e.g. more comprehensive features, more
languages)

•   IDL syntax is slightly cleaner than PB. If you know one, then you know the other

•   Supports: C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa,
JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages

•   Offers a stack for RPC calls

•   Current Version - thrift-0.8.0

•   Apache License 2.0
Avro

 •   I have a lot to say about Avro towards the end
Typical Operation Model

 •   The typical model of Thrift/Protobuf use is

     •   Write down a bunch of struct-like message formats in an IDL-
         like language.

     •   Run a tool to generate Java/C++/whatever boilerplate code.

         •   Example: thrift --gen java MyProject.thrift

     •   Outputs thousands of lines - but they remain fairly readable in
         most languages

     •   Link against this boilerplate when you build your application.

     •   DO NOT EDIT!
Thrift Principle of Operation
Interface Definition Language (IDL)

 •   Web services interfaces are described using the Web Service
     Definition Language. Like SOAP, WSDL is a XML-based
     language.

 •   The new frameworks use their own languages, that are not based
     on XML.

 •   These new languages are very similar to the Interface Definition
     Language, known from CORBA.
Thrift                                    Protobuf
namespace java serializers.thrift.media   package serializers.protobuf.media;

typedef i32 int                           option java_package = "serializers.protobuf.media";
typedef i64 long                          option java_outer_classname = "MediaContentHolder";
                                          option optimize_for = SPEED; affects the C++ and Java
enum Size {                               code generators
  SMALL = 0,
  LARGE = 1,                              message Image {
}                                           required string uri = 1; //url to the thumbnail
enum Player {                               optional string title = 2; //used in the html
  JAVA = 0,                                 required int32 width = 3; // of the image
  FLASH = 1,                                required int32 height = 4; // of the image
}                                           enum Size {
                                              SMALL = 0;
struct Image {                                LARGE = 1;
  1: string uri, //url to the images        }
  2: optional string title,                 required Size size = 5;
  3: required int width,                  }
  4: required int height,
  5: required Size size,                  message Media {
}                                           required string uri = 1;
                                            optional string title = 2;
struct Media {                              required int32 width = 3;
  1: string uri, //url to the thumbnail     required int32 height = 4;
  2: optional string title,                 repeated string person = 5;
  3: required int width,                    enum Player {
  4: required int height,                     JAVA = 0;
  5: required list<string> person,            FLASH = 1;
  6: required Player player,                }
  7: optional string copyright,             required Player player = 6;
}                                           optional string copyright = 7;
                                            }
struct MediaContent {
  1: required list<Image> image,          message MediaContent {
  2: required Media media,                  repeated Image image = 1;
}                                           required Media media = 2;
                                          }
Defining IDL Rules

 •   Every field must have a unique, positive integer
     identifier ("= 1", " = 2" or " 1:", " 2:" )
 •   Fields may be marked as ’required’ or ’optional’
 •   structs/messages may contain other structs/messages
 •   You may specify an optional "default" value for a field
 •   Multiple structs/messages can be defined and referred
     to within the same .thrift/.proto file
Tagging

 •   The numbers are there for a reason!

 •   The "= 1", " = 2" or " 1:", " 2:" markers on each element identify
     the unique "tag" that field uses in the binary encoding.

 •   It is important that these tags do not change on either side

 •   Tags with values in the range 1 through 15 take one byte to
     encode

 •   Tags in the range 16 through 2047 take two bytes

 •   Reserve the tags 1 through 15 for very frequently occurring
     message elements
Java Example (Thrift example)
 // this file is BankDeposit.thrift
 struct BankDepositMsg {
     1: required i32 user_id;
     2: required double amount = 0.00;
     3: required i64 datestamp;}

 ...
 import bank_example.BankDepositMsg;
 ...
 BankDepositMsg my_transaction = new BankDepositMsg();
 my_transaction.setUser_id(123);
 my_transaction.setAmount(1000.00);
 my_transaction.setDatestamp(new Timestamp(date.getTime()));
 ...
 In Java (and other compiled languages) you have the getters and the setters, so that if
     the fields and types are erroneously changed the compiler will inform you of the
     mistake.
The Comparison…
                 Thrift                                              Protocol Buffers
Composite Type    Struct {}                                            Message {}
Base Types       bool                                                bool
                 byte                                                32/64-bit integers
                 16/32/64-bit integers                               float
                 double                                              double
                 string                                              string
                                                                     byte sequence
Containers       list<t1>: An ordered list of elements of type t1.   No
                 May contain duplicates.
                 set<t1>: An unordered set of unique elements of
                 type t1.
                 map<t1,t2>: A map of strictly unique keys of type
                 t1 to values of type t2.

Enumerations     Yes                                            Yes
Constants        Yes                                            No
                 Example:
                 const i32 INT_CONST = 1234;
                 const map<string,string> MAP_CONST = {"hello":
                 "world", "goodnight": "moon"}

Exception        Yes (exception keyword instead of the struct        No
Type/Handling    keyword.)
The Comparison

                            Thrift   Protocol Buffers
License                     Apache   BSD-style

Compiler                    C++      C++

RPC Interfaces              Yes      Yes

RPC Implementation          Yes      No (they do have one internally)

Composite Type Extensions   No       Yes

Data Versioning             Yes      Yes
Performance

 •   To keep things simple a lot is missing in the new frameworks.

 •   For example the extensibility of XML or the splitting of metadata
     (header) and payload (body).

 •   Of course the performance depends on the used operating
     system, programming language and the network.

 •   Size Comparison

 •   Runtime Performance
Size Comparison
Each write includes one Course object with 5 Person objects, and one Phone
object.
                                                               TBinaryProtocol – not optimized
                                                               for space efficiency. Faster to
                                                               process than the text protocol but
                                                               more difficult to debug.

                                                               TCompactProtocol – More
                                                               compact binary format; typically
                                                               more efficient to process as well




Method                              Size (smaller is better)
Thrift — TCompactProtocol           278 (not bad)
Thrift — TBinaryProtocol            460
Protocol Buffers                    250 (winner!)
RMI                                 905
REST — JSON                         559
REST — XML                          836
Runtime Performance

 •   Test Scenario

     •   Query the list of Course numbers.

     •   Fetch the course for each course number.

     •   This scenario is executed 10,000 times. The tests were run on the
         following systems:



     Operating System   Ubuntu®
     CPU                Intel® Core™ 2 T5500 @ 1.66 GHz
     Memory             2GiB
     Cores              2
Runtime Performance
Runtime Performance

                           Server CPU %   Avg. Client CPU %   Avg. Time

REST — XML                 12.00%         80.75%              05:27.45
REST — JSON                20.00%         75.00%              04:44.83
RMI                        16.00%         46.50%              02:14.54
Protocol Buffers           30.00%         37.75%              01:19.48
Thrift — TBinaryProtocol   33.00%         21.00%              01:13.65
Thrift — TCompactProtocol 30.00%          22.50%              01:05.12
Versioning


 •   The system must be able to support reading of old data, as well as
     requests from out-of-date clients to new servers, and vice versa.

 •   Versioning in Thrift and Protobuf is implemented via field identifiers.

 •   The combination of this field identifiers and its type specifier is used
     to uniquely identify the field.

 •   An a new compiling isn't necessary.

 •   Statically typed systems like CORBA or RMI would require an
     update of all clients in this case.
Forward and Backward Compatibility Case Analysis



  There are four cases in which version mismatches may occur:

   1.   Added field, old client, new server.

   2.   Removed field, old client, new server.

   3.   Added field, new client, old server.

   4.   Removed field, new client, old server.
Forward and Backward Compatibility: Example 1




        BankDepositMsg            BankDepositMsg

      user_id: 123              user_id: 123

      amount: 1000.00           amount: 1000.00

      datestamp: 82912323       datestamp: 82912323




  Producer (client) sends a message to a consumer
    (server). All good.
Forward and Backward Compatibility: Example 2




        BankDepositMsg            BankDepositMsg

      user_id: 123              user_id: 123

      amount: 1000.00           amount: 1000.00

      datestamp: 82912323       datestamp: 82912323

                                branch_id: None


  Producer (old client) sends an old message to a
    consumer (new server). The new server recognizes
    that the field is not set, and implements default
    behavior for out-of-date requests… Still good
Forward and Backward Compatibility: Example 3




        BankDepositMsg              BankDepositMsg

      user_id: 123                user_id: 123

      amount: 1000.00             amount: 1000.00

      datestamp: 82912323         datestamp: 82912323

      branch_id: 1333



  Producer (new client) sends a new message to an
    consumer (old server). The old server simply ignores it
    and processes as normal... Still good
Serialization/deserialization performance are unlikely to be a decisive
factor

                   Thrift                                   Protocol Buffers
                   Richer feature set, but varies from      Fewer features but robust
Features
                   language to language                     implementations
                                                            Compare a protobuf Message
                   It was open sourced by Facebook in April definition to a thrift struct definition
Code Quality and
                   2007 probably to speed up development
Design                                                      Compare the protobuf Java generator to
                   and leverage the community’s efforts.
                                                            the thrift Java generator

                                                            Open mailing list
Open-ness          Apache project                           Code base and issue tracker
                                                            Google still drives development
                   Severely lacking, but catching up
Documentation                                               Excellent documentation
                   Compare the protobuf documentation to
                   the thrift wiki
Projects Using Thrift

 •   Applications, projects, and organizations using Thrift include:

     •   Facebook
     •   Cassandra project
     •   Hadoop supports access to its HDFS API through Thrift bindings
     •   HBase leverages Thrift for a cross-language API
     •   Hypertable leverages Thrift for a cross-language API since v0.9.1.0a
     •   LastFM
     •   DoAT
     •   ThriftDB
     •   Scribe
     •   Evernote uses Thrift for its public API.
     •   Junkdepot
Projects Using Protobuf

 •   Google 

 •   ActiveMQ uses the protobuf for Message store

 •   Netty (protobuf-rpc)

 •   I couldn’t find a complete list of protobuf users anywhere 
Pros & Cons

       Thrift                                        Protocol Buffers

                                                     Slightly faster than Thrift when using
                                                     "optimize_for = SPEED"
       More languages supported out of the box
                                                     Serialized objects slightly smaller than Thrift due
       Richer data structures than Protobuf (e.g.:
Pros   Map and Set)
                                                     to more aggressive data compression

                                                     Better documentation
       Includes RPC implementation for services
                                                     API a bit cleaner than Thrift


       Good examples are hard to find                .proto can define services, but no RPC
Cons                                                 implementation is defined (although stubs are
       Missing/incomplete documentation              generated for you).
I’d choose Protocol Buffers over Thrift, If:


  •   You’re only using Java, C++ or Python.
      •   Experimental support for other languages is being
          developed by third parties but are generally not
          considered ready for production use

  •   You already have an RPC implementation
  •   On-the-wire data size is crucial
  •   The lack of any real documentation is scary to you
I’d choose Thrift over Protocol Buffers, If:


  •   Your language requirements are anything but Java,
      C++ or Python.
  •   You need additional data structures like Map and Set
  •   You want a full client/server RPC implementation built-
      in
  •   You’re a good programmer that doesn’t need
      documentation or examples 
Wait, what about Avro?

 •   Avro is another very recent serialization system.

 •   Avro relies on a schema-based system

 •   When Avro data is read, the schema used when writing it is always present.

 •   Avro data is always serialized with its schema. When Avro data is stored in a file,
     its schema is stored with it, so that files may be processed later by any program.

 •   The schemas are equivalent to protocol buffers proto files, but they do not have to
     be generated.

 •   The JSON format is used to declare the data structures.

 •   Official support for four languages: Java, C, C++, C#, Python, Ruby

 •   An RPC framework.

 •   Apache License 2.0
Avro IDL syntax is butt ugly and error prone


  // Avro IDL:
      { "type": "record",
       "name": "BankDepositMsg",
       "fields" : [
         {"name": "user_id", "type": "int"},
         {"name": "amount", "type": "double", "default": "0.00"},
         {"name": "datestamp", "type": "long"}
       ]
      }

  // Same Thrift IDL:
      struct BankDepositMsg {
        1: required i32 user_id;
        2: required double amount = 0.00;
        3: required i64 datestamp;
      }
Comparison

                         Avro   Thrift and Protocol Buffer

Dynamic schema           Yes    No


Built into Hadoop        Yes    No


Schema in JSON           Yes    No

No need to compile       Yes    No


No need to declare IDs   Yes    No


Bleeding edge            Yes    No

Sexy name               Yes    No
Specification

 •   Schema represented in one of:

     •   JSON string, naming a defined type.

     •   JSON object of the form:

         •   {"type": "typeName" ...attributes...}

     •   JSON array

 •   Primitive types: null, boolean, int, long, float, double, bytes, string
     •   {"type": "string"}

 •   Complex types: records, enums, arrays, maps, unions, fixed
Comparison with other systems

 •   Avro provides functionality similar to systems such as Thrift, Protocol
     Buffers, etc.

 •   Dynamic typing: Avro does not require that code be generated. Data is
     always accompanied by a schema that permits full processing of that
     data without code generation, static datatypes, etc.

 •   Untagged data: Since the schema is present when data is read,
     considerably less type information need be encoded with data, resulting
     in smaller serialization size.

 •   No manually-assigned field IDs: When a schema changes, both the old
     and new schema are always present when processing data, so
     differences may be resolved symbolically, using field names.
Avro Hands On Review

 •   Q3 2012, I tested the latest Avro (1.6.3)
 •   It throws you a message incompatible message when
     you change the field name
 •   Serious bug, crashes w/ different versions of message
     (no fw/back compatibility). Emailed avro-dev@...
 •   Documentation is nearly non-existent and no real
     users. Bleeding edge, little support
Q&A

Contenu connexe

Tendances

Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Philip Fisher-Ogden
 
Introduction to Kafka with Spring Integration
Introduction to Kafka with Spring IntegrationIntroduction to Kafka with Spring Integration
Introduction to Kafka with Spring IntegrationBorislav Markov
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafkaJiangjie Qin
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
파이썬 생존 안내서 (자막)
파이썬 생존 안내서 (자막)파이썬 생존 안내서 (자막)
파이썬 생존 안내서 (자막)Heungsub Lee
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
Schema registry
Schema registrySchema registry
Schema registryWhiteklay
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 

Tendances (20)

Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
Introduction to Kafka with Spring Integration
Introduction to Kafka with Spring IntegrationIntroduction to Kafka with Spring Integration
Introduction to Kafka with Spring Integration
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Avro
AvroAvro
Avro
 
파이썬 생존 안내서 (자막)
파이썬 생존 안내서 (자막)파이썬 생존 안내서 (자막)
파이썬 생존 안내서 (자막)
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Schema registry
Schema registrySchema registry
Schema registry
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 

Similaire à Thrift vs Protocol Buffers vs Avro - Biased Comparison

C # (C Sharp).pptx
C # (C Sharp).pptxC # (C Sharp).pptx
C # (C Sharp).pptxSnapeSever
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character EncodingsMobisoft Infotech
 
Enforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code GenerationEnforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code GenerationTim Burks
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]LivePerson
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013Iván Montes
 
python presntation 2.pptx
python presntation 2.pptxpython presntation 2.pptx
python presntation 2.pptxArpittripathi45
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...Dataconomy Media
 
CHAPTER 5 mechanical engineeringasaaa.pptx
CHAPTER 5 mechanical engineeringasaaa.pptxCHAPTER 5 mechanical engineeringasaaa.pptx
CHAPTER 5 mechanical engineeringasaaa.pptxSadhilAggarwal
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftTalentica Software
 
Programming with Python: Week 1
Programming with Python: Week 1Programming with Python: Week 1
Programming with Python: Week 1Ahmet Bulut
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...IndicThreads
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Lag Sucks! GDC 2012
Lag Sucks! GDC 2012Lag Sucks! GDC 2012
Lag Sucks! GDC 2012realjenius
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development SecuritySam Bowne
 
Python programming language introduction unit
Python programming language introduction unitPython programming language introduction unit
Python programming language introduction unitmichaelaaron25322
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 

Similaire à Thrift vs Protocol Buffers vs Avro - Biased Comparison (20)

C # (C Sharp).pptx
C # (C Sharp).pptxC # (C Sharp).pptx
C # (C Sharp).pptx
 
Understanding Character Encodings
Understanding Character EncodingsUnderstanding Character Encodings
Understanding Character Encodings
 
Enforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code GenerationEnforcing API Design Rules for High Quality Code Generation
Enforcing API Design Rules for High Quality Code Generation
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
python presntation 2.pptx
python presntation 2.pptxpython presntation 2.pptx
python presntation 2.pptx
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
 
Building custom kernels for IPython
Building custom kernels for IPythonBuilding custom kernels for IPython
Building custom kernels for IPython
 
CHAPTER 5 mechanical engineeringasaaa.pptx
CHAPTER 5 mechanical engineeringasaaa.pptxCHAPTER 5 mechanical engineeringasaaa.pptx
CHAPTER 5 mechanical engineeringasaaa.pptx
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
I18n
I18nI18n
I18n
 
Programming with Python: Week 1
Programming with Python: Week 1Programming with Python: Week 1
Programming with Python: Week 1
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Lag Sucks! GDC 2012
Lag Sucks! GDC 2012Lag Sucks! GDC 2012
Lag Sucks! GDC 2012
 
Python programming 2nd
Python programming 2ndPython programming 2nd
Python programming 2nd
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
 
Python programming language introduction unit
Python programming language introduction unitPython programming language introduction unit
Python programming language introduction unit
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 

Dernier

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Thrift vs Protocol Buffers vs Avro - Biased Comparison

  • 1. PB vs. Thrift vs. Avro Author: Igor Anishchenko Lohika - May, 2012
  • 2. Problem Statement Simple Distributed Architecture serialize deserialize deserialize serialize • Basic questions are: • What kind of protocol to use, and what data to transmit? • Efficient mechanism for storing and exchanging data • What to do with requests on the server side?
  • 3. …and you want to scale your servers... • When you grow beyond a simple architecture, you want.. • flexibility • ability to grow • latency • and of course - you want it to be simple
  • 4. How components talk • Database protocols - fine. • HTTP + maybe JSON/XML on the front - cool.
  • 5. How components talk • Database protocols - fine. • HTTP + maybe JSON/XML on the front - cool. • But most of the times you have internal APIs.
  • 6. Hasn't this been done before? (yes) • SOAP • CORBA • DCOM, COM+ • JSON, Plain Text, XML
  • 7. Should we pick up one of those? (no) • SOAP • XML, XML and more XML. Do we really need to parse so much XML? • CORBA • Amazing idea, horrible execution • Overdesigned and heavyweight • DCOM, COM+ • Embraced mainly in windows client software • HTTP/JSON/XML/Whatever • Okay, proven – hurray! • But lack protocol description. • You have to maintain both client and server code. • You still have to write your own wrapper to the protocol. • XML has high parsing overhead. • (relatively) expensive to process; large due to repeated tags
  • 8. Decision Time? As a developer - what are you looking for? Be patient, I have something for you on the subsequent slides!!
  • 9. High level goals! • Transparent interaction between multiple programming languages • A language and platform neutral way of serializing structured data for use in communications protocols, data storage etc.
  • 10. High level goals! • Transparent interaction between multiple programming languages • A language and platform neutral way of serializing structured data for use in communications protocols, data storage etc. • Maintain Right balance between: • Efficiency (how much time/space?) • Ease and speed of development • Availability of existing libraries and etc..
  • 11. Consideration: Protocol Space {"deposit_money": "12345678"} JSON Binary '0x6d', '0x6f', '0x6e', '0x01', '0xBC614E' '0x65', '0x79', '0x31', '0x32', '0x33', '0x34', '0x35', '0x36', '0x37', '0x38' Binary takes less space. No contest!
  • 12. Consideration: Protocol Time JSON Binary Push down automata No parser needed. The (PDA) parser (LL(1), binary representation IS LR(1)) -- 1 character [as close as to] the lookahead. Then, final machine representation. translation from characters to native types (int, float, etc) Binary is way faster. No contest
  • 13. Consideration: Protocol Ease of Use JSON Binary Brainless to learn Need to manually write Popular code to define message packets (total pain and error prone!!!) or Use a code generator like Thrift (oh noes, I don't want to learn something new!) Json is easier, binary is a pain.
  • 14. Several smart people have attacked this problem over the years and as a result there several good open source alternatives to choose from Here is where Data Interchange Protocols comes in play…
  • 15. Serialization Frameworks XML, JSON, Protocol Buffers, BERT, BSON, Apache Thrift, Message Pack, Etch, Hessian, ICE, Apache Avro, Custom Protocol...
  • 16. SF have some properties in common • Interface Description (IDL) • Performance • Versioning • Binary Format
  • 17. Protocol Buffer • Designed ~2001 because everything else wasn’t that good those days • Production, proprietary in Google from 2001-2008, open-sourced since 2008 • Battle tested, very stable, well trusted • Every time you hit a Google page, you're hitting several services and several PB code • PB is the glue to all Google services • Official support for four languages: C++, Java, Python, and JavaScript • Does have a lot of third-party support for other languages (of highly variable quality) • Current Version - protobuf-2.4.1 • BSD License
  • 18. Apache Thrift • Designed by an X-Googler in 2007 • Developed internally at Facebook, used extensively there • An open Apache project, hosted in Apache's Inkubator. • Aims to be the next-generation PB (e.g. more comprehensive features, more languages) • IDL syntax is slightly cleaner than PB. If you know one, then you know the other • Supports: C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages • Offers a stack for RPC calls • Current Version - thrift-0.8.0 • Apache License 2.0
  • 19. Avro • I have a lot to say about Avro towards the end
  • 20. Typical Operation Model • The typical model of Thrift/Protobuf use is • Write down a bunch of struct-like message formats in an IDL- like language. • Run a tool to generate Java/C++/whatever boilerplate code. • Example: thrift --gen java MyProject.thrift • Outputs thousands of lines - but they remain fairly readable in most languages • Link against this boilerplate when you build your application. • DO NOT EDIT!
  • 21. Thrift Principle of Operation
  • 22. Interface Definition Language (IDL) • Web services interfaces are described using the Web Service Definition Language. Like SOAP, WSDL is a XML-based language. • The new frameworks use their own languages, that are not based on XML. • These new languages are very similar to the Interface Definition Language, known from CORBA.
  • 23. Thrift Protobuf namespace java serializers.thrift.media package serializers.protobuf.media; typedef i32 int option java_package = "serializers.protobuf.media"; typedef i64 long option java_outer_classname = "MediaContentHolder"; option optimize_for = SPEED; affects the C++ and Java enum Size { code generators SMALL = 0, LARGE = 1, message Image { } required string uri = 1; //url to the thumbnail enum Player { optional string title = 2; //used in the html JAVA = 0, required int32 width = 3; // of the image FLASH = 1, required int32 height = 4; // of the image } enum Size { SMALL = 0; struct Image { LARGE = 1; 1: string uri, //url to the images } 2: optional string title, required Size size = 5; 3: required int width, } 4: required int height, 5: required Size size, message Media { } required string uri = 1; optional string title = 2; struct Media { required int32 width = 3; 1: string uri, //url to the thumbnail required int32 height = 4; 2: optional string title, repeated string person = 5; 3: required int width, enum Player { 4: required int height, JAVA = 0; 5: required list<string> person, FLASH = 1; 6: required Player player, } 7: optional string copyright, required Player player = 6; } optional string copyright = 7; } struct MediaContent { 1: required list<Image> image, message MediaContent { 2: required Media media, repeated Image image = 1; } required Media media = 2; }
  • 24. Defining IDL Rules • Every field must have a unique, positive integer identifier ("= 1", " = 2" or " 1:", " 2:" ) • Fields may be marked as ’required’ or ’optional’ • structs/messages may contain other structs/messages • You may specify an optional "default" value for a field • Multiple structs/messages can be defined and referred to within the same .thrift/.proto file
  • 25. Tagging • The numbers are there for a reason! • The "= 1", " = 2" or " 1:", " 2:" markers on each element identify the unique "tag" that field uses in the binary encoding. • It is important that these tags do not change on either side • Tags with values in the range 1 through 15 take one byte to encode • Tags in the range 16 through 2047 take two bytes • Reserve the tags 1 through 15 for very frequently occurring message elements
  • 26. Java Example (Thrift example) // this file is BankDeposit.thrift struct BankDepositMsg { 1: required i32 user_id; 2: required double amount = 0.00; 3: required i64 datestamp;} ... import bank_example.BankDepositMsg; ... BankDepositMsg my_transaction = new BankDepositMsg(); my_transaction.setUser_id(123); my_transaction.setAmount(1000.00); my_transaction.setDatestamp(new Timestamp(date.getTime())); ... In Java (and other compiled languages) you have the getters and the setters, so that if the fields and types are erroneously changed the compiler will inform you of the mistake.
  • 27. The Comparison… Thrift Protocol Buffers Composite Type Struct {} Message {} Base Types bool bool byte 32/64-bit integers 16/32/64-bit integers float double double string string byte sequence Containers list<t1>: An ordered list of elements of type t1. No May contain duplicates. set<t1>: An unordered set of unique elements of type t1. map<t1,t2>: A map of strictly unique keys of type t1 to values of type t2. Enumerations Yes Yes Constants Yes No Example: const i32 INT_CONST = 1234; const map<string,string> MAP_CONST = {"hello": "world", "goodnight": "moon"} Exception Yes (exception keyword instead of the struct No Type/Handling keyword.)
  • 28. The Comparison Thrift Protocol Buffers License Apache BSD-style Compiler C++ C++ RPC Interfaces Yes Yes RPC Implementation Yes No (they do have one internally) Composite Type Extensions No Yes Data Versioning Yes Yes
  • 29. Performance • To keep things simple a lot is missing in the new frameworks. • For example the extensibility of XML or the splitting of metadata (header) and payload (body). • Of course the performance depends on the used operating system, programming language and the network. • Size Comparison • Runtime Performance
  • 30. Size Comparison Each write includes one Course object with 5 Person objects, and one Phone object. TBinaryProtocol – not optimized for space efficiency. Faster to process than the text protocol but more difficult to debug. TCompactProtocol – More compact binary format; typically more efficient to process as well Method Size (smaller is better) Thrift — TCompactProtocol 278 (not bad) Thrift — TBinaryProtocol 460 Protocol Buffers 250 (winner!) RMI 905 REST — JSON 559 REST — XML 836
  • 31. Runtime Performance • Test Scenario • Query the list of Course numbers. • Fetch the course for each course number. • This scenario is executed 10,000 times. The tests were run on the following systems: Operating System Ubuntu® CPU Intel® Core™ 2 T5500 @ 1.66 GHz Memory 2GiB Cores 2
  • 33. Runtime Performance Server CPU % Avg. Client CPU % Avg. Time REST — XML 12.00% 80.75% 05:27.45 REST — JSON 20.00% 75.00% 04:44.83 RMI 16.00% 46.50% 02:14.54 Protocol Buffers 30.00% 37.75% 01:19.48 Thrift — TBinaryProtocol 33.00% 21.00% 01:13.65 Thrift — TCompactProtocol 30.00% 22.50% 01:05.12
  • 34. Versioning • The system must be able to support reading of old data, as well as requests from out-of-date clients to new servers, and vice versa. • Versioning in Thrift and Protobuf is implemented via field identifiers. • The combination of this field identifiers and its type specifier is used to uniquely identify the field. • An a new compiling isn't necessary. • Statically typed systems like CORBA or RMI would require an update of all clients in this case.
  • 35. Forward and Backward Compatibility Case Analysis There are four cases in which version mismatches may occur: 1. Added field, old client, new server. 2. Removed field, old client, new server. 3. Added field, new client, old server. 4. Removed field, new client, old server.
  • 36. Forward and Backward Compatibility: Example 1 BankDepositMsg BankDepositMsg user_id: 123 user_id: 123 amount: 1000.00 amount: 1000.00 datestamp: 82912323 datestamp: 82912323 Producer (client) sends a message to a consumer (server). All good.
  • 37. Forward and Backward Compatibility: Example 2 BankDepositMsg BankDepositMsg user_id: 123 user_id: 123 amount: 1000.00 amount: 1000.00 datestamp: 82912323 datestamp: 82912323 branch_id: None Producer (old client) sends an old message to a consumer (new server). The new server recognizes that the field is not set, and implements default behavior for out-of-date requests… Still good
  • 38. Forward and Backward Compatibility: Example 3 BankDepositMsg BankDepositMsg user_id: 123 user_id: 123 amount: 1000.00 amount: 1000.00 datestamp: 82912323 datestamp: 82912323 branch_id: 1333 Producer (new client) sends a new message to an consumer (old server). The old server simply ignores it and processes as normal... Still good
  • 39. Serialization/deserialization performance are unlikely to be a decisive factor Thrift Protocol Buffers Richer feature set, but varies from Fewer features but robust Features language to language implementations Compare a protobuf Message It was open sourced by Facebook in April definition to a thrift struct definition Code Quality and 2007 probably to speed up development Design Compare the protobuf Java generator to and leverage the community’s efforts. the thrift Java generator Open mailing list Open-ness Apache project Code base and issue tracker Google still drives development Severely lacking, but catching up Documentation Excellent documentation Compare the protobuf documentation to the thrift wiki
  • 40. Projects Using Thrift • Applications, projects, and organizations using Thrift include: • Facebook • Cassandra project • Hadoop supports access to its HDFS API through Thrift bindings • HBase leverages Thrift for a cross-language API • Hypertable leverages Thrift for a cross-language API since v0.9.1.0a • LastFM • DoAT • ThriftDB • Scribe • Evernote uses Thrift for its public API. • Junkdepot
  • 41. Projects Using Protobuf • Google  • ActiveMQ uses the protobuf for Message store • Netty (protobuf-rpc) • I couldn’t find a complete list of protobuf users anywhere 
  • 42. Pros & Cons Thrift Protocol Buffers Slightly faster than Thrift when using "optimize_for = SPEED" More languages supported out of the box Serialized objects slightly smaller than Thrift due Richer data structures than Protobuf (e.g.: Pros Map and Set) to more aggressive data compression Better documentation Includes RPC implementation for services API a bit cleaner than Thrift Good examples are hard to find .proto can define services, but no RPC Cons implementation is defined (although stubs are Missing/incomplete documentation generated for you).
  • 43. I’d choose Protocol Buffers over Thrift, If: • You’re only using Java, C++ or Python. • Experimental support for other languages is being developed by third parties but are generally not considered ready for production use • You already have an RPC implementation • On-the-wire data size is crucial • The lack of any real documentation is scary to you
  • 44. I’d choose Thrift over Protocol Buffers, If: • Your language requirements are anything but Java, C++ or Python. • You need additional data structures like Map and Set • You want a full client/server RPC implementation built- in • You’re a good programmer that doesn’t need documentation or examples 
  • 45. Wait, what about Avro? • Avro is another very recent serialization system. • Avro relies on a schema-based system • When Avro data is read, the schema used when writing it is always present. • Avro data is always serialized with its schema. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. • The schemas are equivalent to protocol buffers proto files, but they do not have to be generated. • The JSON format is used to declare the data structures. • Official support for four languages: Java, C, C++, C#, Python, Ruby • An RPC framework. • Apache License 2.0
  • 46. Avro IDL syntax is butt ugly and error prone // Avro IDL: { "type": "record", "name": "BankDepositMsg", "fields" : [ {"name": "user_id", "type": "int"}, {"name": "amount", "type": "double", "default": "0.00"}, {"name": "datestamp", "type": "long"} ] } // Same Thrift IDL: struct BankDepositMsg { 1: required i32 user_id; 2: required double amount = 0.00; 3: required i64 datestamp; }
  • 47. Comparison Avro Thrift and Protocol Buffer Dynamic schema Yes No Built into Hadoop Yes No Schema in JSON Yes No No need to compile Yes No No need to declare IDs Yes No Bleeding edge Yes No Sexy name  Yes No
  • 48. Specification • Schema represented in one of: • JSON string, naming a defined type. • JSON object of the form: • {"type": "typeName" ...attributes...} • JSON array • Primitive types: null, boolean, int, long, float, double, bytes, string • {"type": "string"} • Complex types: records, enums, arrays, maps, unions, fixed
  • 49. Comparison with other systems • Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size. • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.
  • 50. Avro Hands On Review • Q3 2012, I tested the latest Avro (1.6.3) • It throws you a message incompatible message when you change the field name • Serious bug, crashes w/ different versions of message (no fw/back compatibility). Emailed avro-dev@... • Documentation is nearly non-existent and no real users. Bleeding edge, little support
  • 51. Q&A

Notes de l'éditeur

  1. Assigning TagsAs you can see, each field in the message definition has a unique numbered tag. These tags are used to identify your fields in themessage binary format, and should not be changed once your message type is in use. Note that tags with values in the range 1 through 15 take one byte to encode, including the identifying number and the field&apos;s type (you can find out more about this in Protocol Buffer Encoding). Tags in the range 16 through 2047 take two bytes. So you should reserve the tags 1 through 15 for very frequently occurring message elements. Remember to leave some room for frequently occurring elements that might be added in the future.