3. Overview
• A data serialization system.
• An RPC framework.
• For: storage & comm.
• Purpose:
– Provide rich data structures.
– A compact and fast binary data format.
– Simple integration with dynamic languages.
4. Overview
• Avro uses JSON for Interface Description
Language (IDL).
– To specify data types.
– To specify protocols.
• Review: JavaScript Object Notation is just a
light-weight text-based standard for data
interchange.
5. Why the need for Avro?
• Primary usage in Hadoop, provides standard:
1. Serialization format for persistent data.
2. Wire format for communication ..
• .. among Hadoop nodes.
• .. from client programs to Hadoop services.
6. Overview
• Avro relies on schemas.
– Schema stored with data.
– Each datum written with no per-value overheads.
• Thus serialization is fast and small.
• Avro in RPC:
– Schema exchange during client-server handshake.
– Correspondence in fields can be easily resolved.
8. Comparison with other systems
• Avro vs. Protobuf and Thrift.
• A quick note about Thrift:
– Initially developed at Facebook by a Google intern.
– Closer to Google’s protobuf.
9. Comparison with other systems
Avro Google protobuf Thrift
Implementation Hmm.. Cleaner Hmm..
Error handling Complex Simple OK
Extensibility Hmm.. Richer OK
Compatibility Java, C, C++, C#, That and much About the same as
Python and Ruby more such as protobuf
Adobe Actionscript,
Microsoft
Silverlight, etc.
10. Specification
• Schema represented in one of:
– JSON string, naming a defined type.
– JSON object of the form:
• {"type": "typeName" ...attributes...}
– JSON array
• Primitive types: null, boolean, int, long, float,
double, bytes, string
– {"type": "string"}
• Complex types: records, enums, arrays, maps,
unions, fixed