SlideShare a Scribd company logo
1 of 28
Avro
 Etymology & History
 Sexy Tractors
 Project Drivers & Overview
 Serialization
 RPC
 Hadoop Support
Etymology
 British aircraft manufacturer
 1910-1963
History
 Doug Cutting – Cloudera, Hadoop project founder
 2002 – Nutch
 2004 – Google GFS, MapReduce whitepapers
 2005 – NDFS & MR, Writable & SequenceFile
 2006 – Hadoop split from Nutch, renamed NDFS to
  HDFS
 2007 – Yahoo gets involved, HBase, Pig, Zookeeper
 2008 – Terrasort contest winner, Hive, Mahout,
  Cassandra
 2009 – Oozie, Flume, Hue
History
 Underlying serialization system basically unchanged
 Additional language support and data formats
 Language, data format combinatorial explosion
    C++ JSON to Java BSON
    Python Smile to PHP CSV
 Apr 2009 – Avro proposal
 May 2010 – Top-level project
Sexy Tractors
 Data serialization tools, like tractors, aren’t sexy
 They should be!
 Dollar for dollar storage capacity has increased
  exponentially, doubling every 1.5-2 years
 Throughput of magnetic storage and network has not
  maintained this pace
 Distributed systems are the norm
 Efficient data serialization techniques and tools are
  vital
Project Drivers
 Common data format for serialization and RPC
 Dynamic
 Expressive
 Efficient
 File format
    Well defined
    Standalone
    Splittable & compressed
Biased Comparison
              CSV   XML/JSON   SequenceFile   Thrift & PB   Avro

Language      Yes   Yes        No             Yes           Yes
Independent
Expressive    No    Yes        Yes            Yes           Yes

Efficient     No    No         Yes            Yes           Yes

Dynamic       Yes   Yes        No             No            Yes

Standalone    ?     Yes        No             No            Yes

Splittable    ?     ?          Yes            ?             Yes
Project Overview
 Specification based design
 Dynamic implementations
 File format
 Schemas
    Must support JSON implementation
    IDL often supported
    Evolvable
 First class Hadoop support
Specification Based Design
 Schemas
 Encoding
 Sort order
 Object container files
 Codecs
 Protocol
 Protocol write format
 Schema resolution
Specification Based Design
 Schemas
    Primitive types
        Null, boolean, int, long, float, double, bytes, string
    Complex types
      Records, enums, arrays, maps, unions and fixed

    Named types
      Records, enums, fixed
      Name & namespace

    Aliases
    http://avro.apache.org/docs/current/spec.html#schema
     s
Schema Example
log-message.avpr

{
    "namespace": "com.emoney",
    "name": "LogMessage",
    "type": "record",
    "fields": [
       {"name": "level", "type": "string", "comment" : "this is ignored"},
       {"name": "message", "type": "string", "description" : "this is the message"},
       {"name": "dateTime", "type": "long"},
       {"name": "exceptionMessage", "type": ["null", "string"]}
    ]
}
Specification Based Design
 Encodings
    JSON – for debugging
    Binary
 Sort order
    Efficient sorting by system other than writer
    Sorting binary-encoded data without deserialization
Specification Based Design
 Object container files
    Schema
    Serialized data written to binary-encoded blocks
    Blocks may be compressed
    Synchronization markers
 Codecs
    Null
    Deflate
    Snappy (optional)
    LZO (future)
Specification Based Design
 Protocol
    Protocol name
    Namespace
    Types
        Named types used in messages
    Messages
        Uniquely named message
        Request
        Response
        Errors
 Wire format
   Transports
   Framing
   Handshake
Protocol
{
    "namespace": "com.acme",
    "protocol": "HelloWorld",
    "doc": "Protocol Greetings",

    "types": [
       {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]},
       {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ],

    "messages": {
      "hello": {
        "doc": "Say hello.",
        "request": [{"name": "greeting", "type": "Greeting" }],
        "response": "Greeting",
        "errors": ["Curse"]
      }
    }
}
Schema Resolution & Evolution
   Writers schema always provided to reader
   Compare schema used by writer & schema expected by reader
   Fields that match name & type are read
   Fields written that don’t match are skipped
   Expected fields not written can be identified
      Error or provide default value
 Same features as provided by numeric field ids
    Keeps fields symbolic, no index IDs written in data
 Allows for projections
    Very efficient at skipping fields
 Aliases
    Allows projections from 2 different types using aliases
    User transaction
          Count, date
      Batch
        Count, date
Implementations
   Core – parse schemas, read & write binary data for a schema
   Data file – read & write Avro data files
   Codec – supported codecs
   RPC/HTTP – make and receive calls over HTTP
Implementation         Core         Data file         Codec          RPC/HTTP
C                Yes           Yes              Deflate         Yes
C++              Yes           Yes              ?               Yes
C#               Yes           No               N/A             No
Java             Yes           Yes              Deflate, Snappy Yes
Python           Yes           Yes              Deflate         Yes
Ruby             Yes           Yes              Deflate         Yes
PHP              Yes           Yes              ?               No
API
 Generic
    Generic attribute/value data structure
    Best suited for dynamic processing
 Specific
    Each record corresponds to a different kind of object in the
     programming language
    RPC systems typically use this
 Reflect
    Schemas generated via reflection
    Converting an existing codebase to use Avro
API
 Low-level
    Schema
    Encoders
    DatumWriter
    DatumReader
 High-level
    DataFileWriter
    DataFileReader
Java Example
Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr"));

OutputStream outputStream = new FileOutputStream("data.avro");

DataFileWriter<Message> writer =
        new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema));

writer.setCodec(CodecFactory.deflateCodec(1));
writer.create(schema, outputStream);

writer.append(new Message ());

writer.close();
Java Example
DataFileReader<Message> reader = new DataFileReader<Message>(
         new File("data.avro"),
         new GenericDatumReader<Message>());

for (Message next : reader) {
  System.out.println("next: " + next);
}
RPC
 Server
    SocketServer (non-standard)
    SaslSocketServer
    HttpServer
    NettyServer
    DatagramServer (non-standard)
 Responder
    Generic
    Reflect
    Specific
 Client
    Corresponding Transceiver
    LocalTransceiver
 Requestor
RPC
 Client
    Corresponding Transceiver for each server
    LocalTransceiver
 Requestor
RPC Server
Protocol protocol = Protocol.parse(new File("protocol.avpr"));

InetSocketAddress address = new InetSocketAddress("localhost", 33333);

GenericResponder responder = new GenericResponder(protocol) {
   @Override
   public Object respond(Protocol.Message message, Object request)
   throws Exception {
     ...
   }
};

new SocketServer(responder, address).join();
Hadoop Support
 File writers and readers
 Replacing RPC with Avro
    In Flume already
 Pig support is in
 Splittable
    Set block size when writing
 Tether jobs
    Connector framework for other languages
    Hadoop Pipes
Future
 RPC
    Hbase, Cassandra, Hadoop core
 Hive in progress
 Tether jobs
    Actual MapReduce implementations in other languages
Avro
 Dynamic
 Expressive
 Efficient
 Specification based design
 Language implementations are fairly solid
 Serialization or RPC or both
 First class Hadoop support
 Currently 1.5.1
 Sexy tractors

More Related Content

What's hot

SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz
 

What's hot (20)

Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and VormetricProtecting your data at rest with Apache Kafka by Confluent and Vormetric
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 

Viewers also liked

Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)
Cloudera, Inc.
 
G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2
eleksdev
 
Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
Alex Tumanoff
 

Viewers also liked (20)

Avro intro
Avro introAvro intro
Avro intro
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 
3 apache-avro
3 apache-avro3 apache-avro
3 apache-avro
 
Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)
 
排队排队--kafka
排队排队--kafka排队排队--kafka
排队排队--kafka
 
맛만 보자 Finagle이란
맛만 보자 Finagle이란 맛만 보자 Finagle이란
맛만 보자 Finagle이란
 
java thrift
java thriftjava thrift
java thrift
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise
 
RPC protocols
RPC protocolsRPC protocols
RPC protocols
 
Protobuf & Code Generation + Go-Kit
Protobuf & Code Generation + Go-KitProtobuf & Code Generation + Go-Kit
Protobuf & Code Generation + Go-Kit
 
OpenFest 2016 - Open Microservice Architecture
OpenFest 2016 - Open Microservice ArchitectureOpenFest 2016 - Open Microservice Architecture
OpenFest 2016 - Open Microservice Architecture
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
G rpc lection1
G rpc lection1G rpc lection1
G rpc lection1
 
G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2G rpc lection1_theory_bkp2
G rpc lection1_theory_bkp2
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPC
 
Apache Avro and You
Apache Avro and YouApache Avro and You
Apache Avro and You
 
아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift) 아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift)
 
Building High Performance APIs In Go Using gRPC And Protocol Buffers
Building High Performance APIs In Go Using gRPC And Protocol BuffersBuilding High Performance APIs In Go Using gRPC And Protocol Buffers
Building High Performance APIs In Go Using gRPC And Protocol Buffers
 
Serialization and performance by Sergey Morenets
Serialization and performance by Sergey MorenetsSerialization and performance by Sergey Morenets
Serialization and performance by Sergey Morenets
 

Similar to Avro

Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the rest
george.james
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
Sri Ambati
 

Similar to Avro (20)

Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?
 
Web Development Environments: Choose the best or go with the rest
Web Development Environments:  Choose the best or go with the restWeb Development Environments:  Choose the best or go with the rest
Web Development Environments: Choose the best or go with the rest
 
Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...Building scalable and language-independent Java services using Apache Thrift ...
Building scalable and language-independent Java services using Apache Thrift ...
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Not only SQL
Not only SQL Not only SQL
Not only SQL
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDB
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
 
The Glory of Rest
The Glory of RestThe Glory of Rest
The Glory of Rest
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
 
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...
 
Webtechnologies
Webtechnologies Webtechnologies
Webtechnologies
 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Avro

  • 1.
  • 2. Avro  Etymology & History  Sexy Tractors  Project Drivers & Overview  Serialization  RPC  Hadoop Support
  • 3. Etymology  British aircraft manufacturer  1910-1963
  • 4. History  Doug Cutting – Cloudera, Hadoop project founder  2002 – Nutch  2004 – Google GFS, MapReduce whitepapers  2005 – NDFS & MR, Writable & SequenceFile  2006 – Hadoop split from Nutch, renamed NDFS to HDFS  2007 – Yahoo gets involved, HBase, Pig, Zookeeper  2008 – Terrasort contest winner, Hive, Mahout, Cassandra  2009 – Oozie, Flume, Hue
  • 5. History  Underlying serialization system basically unchanged  Additional language support and data formats  Language, data format combinatorial explosion  C++ JSON to Java BSON  Python Smile to PHP CSV  Apr 2009 – Avro proposal  May 2010 – Top-level project
  • 6. Sexy Tractors  Data serialization tools, like tractors, aren’t sexy  They should be!  Dollar for dollar storage capacity has increased exponentially, doubling every 1.5-2 years  Throughput of magnetic storage and network has not maintained this pace  Distributed systems are the norm  Efficient data serialization techniques and tools are vital
  • 7. Project Drivers  Common data format for serialization and RPC  Dynamic  Expressive  Efficient  File format  Well defined  Standalone  Splittable & compressed
  • 8. Biased Comparison CSV XML/JSON SequenceFile Thrift & PB Avro Language Yes Yes No Yes Yes Independent Expressive No Yes Yes Yes Yes Efficient No No Yes Yes Yes Dynamic Yes Yes No No Yes Standalone ? Yes No No Yes Splittable ? ? Yes ? Yes
  • 9. Project Overview  Specification based design  Dynamic implementations  File format  Schemas  Must support JSON implementation  IDL often supported  Evolvable  First class Hadoop support
  • 10. Specification Based Design  Schemas  Encoding  Sort order  Object container files  Codecs  Protocol  Protocol write format  Schema resolution
  • 11. Specification Based Design  Schemas  Primitive types  Null, boolean, int, long, float, double, bytes, string  Complex types  Records, enums, arrays, maps, unions and fixed  Named types  Records, enums, fixed  Name & namespace  Aliases  http://avro.apache.org/docs/current/spec.html#schema s
  • 12. Schema Example log-message.avpr { "namespace": "com.emoney", "name": "LogMessage", "type": "record", "fields": [ {"name": "level", "type": "string", "comment" : "this is ignored"}, {"name": "message", "type": "string", "description" : "this is the message"}, {"name": "dateTime", "type": "long"}, {"name": "exceptionMessage", "type": ["null", "string"]} ] }
  • 13. Specification Based Design  Encodings  JSON – for debugging  Binary  Sort order  Efficient sorting by system other than writer  Sorting binary-encoded data without deserialization
  • 14. Specification Based Design  Object container files  Schema  Serialized data written to binary-encoded blocks  Blocks may be compressed  Synchronization markers  Codecs  Null  Deflate  Snappy (optional)  LZO (future)
  • 15. Specification Based Design  Protocol  Protocol name  Namespace  Types  Named types used in messages  Messages  Uniquely named message  Request  Response  Errors  Wire format  Transports  Framing  Handshake
  • 16. Protocol { "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings", "types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type": "string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ], "messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } } }
  • 17. Schema Resolution & Evolution  Writers schema always provided to reader  Compare schema used by writer & schema expected by reader  Fields that match name & type are read  Fields written that don’t match are skipped  Expected fields not written can be identified  Error or provide default value  Same features as provided by numeric field ids  Keeps fields symbolic, no index IDs written in data  Allows for projections  Very efficient at skipping fields  Aliases  Allows projections from 2 different types using aliases  User transaction  Count, date  Batch  Count, date
  • 18. Implementations  Core – parse schemas, read & write binary data for a schema  Data file – read & write Avro data files  Codec – supported codecs  RPC/HTTP – make and receive calls over HTTP Implementation Core Data file Codec RPC/HTTP C Yes Yes Deflate Yes C++ Yes Yes ? Yes C# Yes No N/A No Java Yes Yes Deflate, Snappy Yes Python Yes Yes Deflate Yes Ruby Yes Yes Deflate Yes PHP Yes Yes ? No
  • 19. API  Generic  Generic attribute/value data structure  Best suited for dynamic processing  Specific  Each record corresponds to a different kind of object in the programming language  RPC systems typically use this  Reflect  Schemas generated via reflection  Converting an existing codebase to use Avro
  • 20. API  Low-level  Schema  Encoders  DatumWriter  DatumReader  High-level  DataFileWriter  DataFileReader
  • 21. Java Example Schema schema = Schema.parse(getClass().getResourceAsStream("schema.avpr")); OutputStream outputStream = new FileOutputStream("data.avro"); DataFileWriter<Message> writer = new DataFileWriter<Message>(new GenericDatumWriter<Message>(schema)); writer.setCodec(CodecFactory.deflateCodec(1)); writer.create(schema, outputStream); writer.append(new Message ()); writer.close();
  • 22. Java Example DataFileReader<Message> reader = new DataFileReader<Message>( new File("data.avro"), new GenericDatumReader<Message>()); for (Message next : reader) { System.out.println("next: " + next); }
  • 23. RPC  Server  SocketServer (non-standard)  SaslSocketServer  HttpServer  NettyServer  DatagramServer (non-standard)  Responder  Generic  Reflect  Specific  Client  Corresponding Transceiver  LocalTransceiver  Requestor
  • 24. RPC  Client  Corresponding Transceiver for each server  LocalTransceiver  Requestor
  • 25. RPC Server Protocol protocol = Protocol.parse(new File("protocol.avpr")); InetSocketAddress address = new InetSocketAddress("localhost", 33333); GenericResponder responder = new GenericResponder(protocol) { @Override public Object respond(Protocol.Message message, Object request) throws Exception { ... } }; new SocketServer(responder, address).join();
  • 26. Hadoop Support  File writers and readers  Replacing RPC with Avro  In Flume already  Pig support is in  Splittable  Set block size when writing  Tether jobs  Connector framework for other languages  Hadoop Pipes
  • 27. Future  RPC  Hbase, Cassandra, Hadoop core  Hive in progress  Tether jobs  Actual MapReduce implementations in other languages
  • 28. Avro  Dynamic  Expressive  Efficient  Specification based design  Language implementations are fairly solid  Serialization or RPC or both  First class Hadoop support  Currently 1.5.1  Sexy tractors