SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
http://guidoschmutz@wordpress.com@gschmutz
Big Data, Data Lake, Datenserialisierungsformate
Guido Schmutz
Guido
Working at Trivadis for more than 23 years
Consultant, Trainer, Platform Architect for Java,
Oracle, SOA and Big Data / Fast Data
Oracle Groundbreaker Ambassador & Oracle ACE
Director
@gschmutz guidoschmutz.wordpress.com
210th
edition
Agenda
• Introduction
• Avro vs. Protobuf
• Serialization in Big Data, Data Lake & Fast Data
• Protobuf and gRPC
• Summary
https://bit.ly/2zz0CV4
4
Introduction
5
What is Serialization / Deserialization ?
Serialization is the process of turning structured in-memory objects into a byte stream for transmission
over a network or for writing to persistent storage
Deserialization is the reverse process from a byte stream back to a series of structured in-memory
objects
When selecting a data serialization format, the following characteristics should be evaluated:
• Schema support and Schema evolution
• Code generation
• Language support / Interoperability
• Transparent compression
• Splitability
• Support in Big Data / Fast Data Ecosystem
6
Where do we need Serialization / Deserialization ?
Service / Client
Logic
Event
Broker
Publish-
Subscribe
Data Lake
Service
{ }
API Logic
REST
Parallel
ProcessingStorage
Raw
serialize deserialize
serializedeserialize
serialize
deserialize
deserialize
serialize
Storage
Refined
Integration
Data Flow
serialize
serializedeserialize
Stream Analyticsdeserialize
serialize
ResultsStream Analytics
Streaming
Source
serialize
7
Sample Data Structured used in this presentation
Person (1.0)
• id : integer
• firstName : text
• lastName : text
• title :
enum(unknown,mr,mrs,ms)
• emailAddress : text
• phoneNumber : text
• faxNumber : text
• dateOfBirth : date
• addresses : array<Address>
Address (1.0)
• streetAndNr : text
• zipAndCity : text
{
"id":"1",
"firstName":"Peter",
"lastName":"Sample",
"title":"mr",
"emailAddress":"peter.sample@somecorp.com",
"phoneNumber":"+41 79 345 34 44",
"faxNumber":"+41 31 322 33 22",
"dateOfBirth":"1995-11-10",
"addresses":[
{
"id":"1",
"streetAndNr":"Somestreet 10",
"zipAndCity":"9332 Somecity"
}
]
}
https://github.com/gschmutz/various-demos/tree/master/avro-vs-protobuf
8
Avro vs. Protobuf
9
Google Protocol Buffers
• https://developers.google.com/protocol-buffers/
• Protocol buffers (protobuf) are Google's language-neutral, platform-neutral, extensible mechanism
for serializing structured data
• like XML, but smaller, faster, and simpler
• Schema is needed to generate code
and read/write data
• Support generated code in Java, Python,
Objective-C, C++, Go, Ruby, and C#
• Two different versions: proto2 and proto3
• Presentation based on proto3
• Latest version: 3.13.0
10
Apache Avro
• http://avro.apache.org/docs/current/
• Apache Avro™ is a compact, fast, binary data serialization system invented by the makers of Hadoop
• Avro relies on schemas. When data
is read, the schema used when writing
it is always present
• container file for storing persistent data
• Works both with code generation as well
as in a dynamic manner
• Latest version: 1.10.0
11
Overview
.avdl
file
Serialized
Data
Specific Record
Generic Record
Generator
.avsc
file
Java
C#
Python
Go
…
Java
C#
Python
Go
…
Serialized
Data
Message
Generator
.proto
file
Java
C#
Python
Go
…
12
Defining Schema - IDL
syntax = "proto3";
package com.trivadis.protobuf.person.v1;
import "address-v1.proto";
import "title-enum-v1.proto";
import "google/protobuf/timestamp.proto";
option java_outer_classname = "PersonWrapper";
message Person {
int32 id = 1;
string first_name = 2;
string last_name = 3;
com.Trivadis.protobuf.lov.Title title = 4;
string email_address = 5;
string phone_number = 6;
string fax_number = 7;
google.protobuf.Timestamp date_of_birth = 8;
repeated trivadis.protobuf.address.v1.Addresss
addresses = 9;
}
syntax = "proto3";
package com.trivadis.protobuf.lov;
enum Title {
UNKNOWN = 0;
MR = 1;
MRS = 2;
MS = 3;
}
syntax = "proto3";
package
com.trivadis.protobuf.address.v1;
message Addresss {
int32 id = 1;
string street_and_nr = 2;
string zip_and_city = 3;
}
person-v1.proto title-v1.proto
address-v1.proto
https://developers.google.com/protocol-buffers/docs/proto3
13
Defining Schema – JSON
Person-v1.avsc
{
"type" : "record",
"namespace" : "com.trivadis.avro.person.v1",
"name" : "Person",
"description" : "the representation of a person",
"fields" : [
{ "name": "id", "type": "int" },
{ "name": "firstName", "type": "string" },
{ "name": "lastName", "type": "string" },
{ "name" : "title", "type" : {
"type" : "enum", "name" : "TitleEnum",
"symbols" : ["Unknown", "Mr", "Mrs", "Ms"]
}
},
{ "name": "emailAddress", "type": ["null","string"] },
{ "name": "phoneNumber", "type": ["null","string"] },
{ "name": "faxNumber", "type": ["null","string"] },
{ "name": "dateOfBirth", "type": {
"type": "int", "logicalType": "date" } },
{ "name" : "addresses", ... }
]
}
https://avro.apache.org/docs/current/spec.html
14
Defining Schema - IDL
@namespace("com.trivadis.avro.person.v1")
protocol PersonIdl {
import idl "Address-v1.avdl";
enum TitleEnum {
Unknown, Mr, Ms, Mrs
}
record Person {
int id;
string firstName;
string lastName;
TitleEnum title;
union { null, string } emailAddress;
union { null, string } phoneNumber;
union { null, string } faxNumber;
date dateOfBirth;
array<com.trivadis.avro.address.v1.Address> addresses;
}
}
@namespace("com.trivadis.avro.address.v1")
protocol AddressIdl {
record Address {
int id;
string streetAndNr;
string zipAndCity;
}
}
Note: JSON Schema can be
generated from IDL Schema using
Avro Tools
address-v1.avdl
Person-v1.avdl
https://avro.apache.org/docs/current/idl.html
15
Defining Schema - Specification
• Multiple message types can be defined in
single proto file
• Field Numbers – each field in the message has
a unique number
• used to identify the fields in the message
binary format
• should not be changed once message type is in
use
• 1 – 15 uses single byte, 16 – 2047 uses two
bytes to encode
• Default values are type-specific
• Schema can either be represented as JSON or
by using the IDL
• Avro specifies two serialization encodings:
binary and JSON
• Encoding is done in order of fields defined in
record
• schema used to write data always needs to be
available when the data is read
• Schema can be serialized with the data or
• Schema is made available through registry
16
Defining Schema - Data Types
• Scalar Types
• double, float, int32, int64, uint32, uint64, sint32,
sint64, fixed32, fixed64, sfixed32, sfixed64
• bool
• string
• bytes
• Embedded Messages
• Enumerations
• Repeated
• Scalar Types
• null
• int, long, float, double
• boolean
• string
• bytes
• Records
• Map (string, Schema)
• Arrays (Schema)
• Enumerations
• Union
• Logical Types
17
Defining Schema - Style Guides
• Use CamelCase (with an initial capital) for
message names
• Use underscore_separated_names for
field names
• Use CamelCase (with an initial capital) for
enum type names
• Use CAPITALS_WITH_UNDERSCORES for
enum value names
• Use java-style comments for documenting
• Use CamelCase (with an initial capital) for
record names
• Use Camel Case for field names
• Use CamelCase (with an initial capital) for
enum type names
• Use CAPITALS_WITH_UNDERSCORES for
enum value names
• Use java-style comments (IDL) or doc property
(JSON) for documenting
18
IDE Support
Eclipse
• https://marketplace.eclipse.org/content/protobuf-dt
IntelliJ
• https://plugins.jetbrains.com/plugin/8277-protobuf-
support
Eclipse
• https://marketplace.eclipse.org/content/avroclipse
IntelliJ
• https://plugins.jetbrains.com/plugin/7971-apache-
avro-support
19
With Code Generation – Generate the code
• Run the protocol buffer compiler
• One compiler for all supported languages
• Produces classes for the given language
• Run the specific tool for the given language
• For Java
• For C++
• For C#
protoc -I=$SRC_DIR --java_out=$DST_DIR
$SRC_DIR/person-v1.proto
java -jar /path/to/avro-tools-1.8.2.jar
compile schema Person-v1.avsc .
avrogencpp -i cpx.json -o cpx.hh -n c
Microsoft.Hadoop.Avro.Tools codegen
/i:C:SDKsrcMicrosoft.Hadoop.Avro.Tool
sSampleJSONSampleJSONSchema.avsc /o:
20
With Code Generation – Using Maven
• Use protobuf-maven-plugin for
generating code at maven build
• Generates to target/generated-sources
• Scans all project dependencies for .proto files
• protoc has to be installed on machine
• Use avro-maven-plugin for generating
code at maven build
• Generates to target/generated-sources
21
Using Protobuf and Avro from Java
if you are using Maven, add the following dependency to your POM:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.10.0</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.13.0</version>
</dependency>
22
With Code Generation – Create an instance
addresses.add(Address.newBuilder()
.setId(1)
.setStreetAndNr("Somestreet 10")
.setZipAndCity("9332 Somecity").build())
Person person = Person.newBuilder()
.setId(1)
.setFirstName("Peter")
.setLastName("Muster")
.setEmailAddress("peter.muster@somecorp.com")
.setPhoneNumber("+41 79 345 34 44")
.setFaxNumber("+41 31 322 33 22")
.setTitle(TitleEnum.Mr)
.setDateOfBirth(LocalDate.parse("1995-11-10"))
.setAddresses(addresses).build();
addresses.add(Addresss.newBuilder()
.setId(1)
.setStreetAndNr("Somestreet 10")
.setZipAndCity("9332 Somecity").build());
Instant time = Instant
.parse("1995-11-10T00:00:00.00Z");
Timestamp timestamp = Timestamp.newBuilder()
.setSeconds(time.getEpochSecond())
.setNanos(time.getNano()).build();
Person person = Person.newBuilder()
.setId(1)
.setFirstName("Peter")
.setLastName("Muster")
.setEmailAddress("peter.muster@somecorp.com")
.setPhoneNumber("+41 79 345 34 34")
.setFaxNumber("+41 31 322 33 22")
.setTitle(TitleEnumWrapper.Title.MR)
.setDateOfBirth(timestamp)
.addAllAddresses(addresses).build();
23
With Code Generation – Serializing
FileOutputStream fos = new FileOutputStream(BIN_FILE_NAME_V1));
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<Person> writer = new
SpecificDatumWriter<Person>(Person.getClassSchema());
writer.write(person, EncoderFactory.get().binaryEncoder(out, null));
encoder.flush();
out.close();
byte[] serializedBytes = out.toByteArray();
fos.write(serializedBytes);
FileOutputStream output = new
FileOutputStream(BIN_FILE_NAME_V2);
person.writeTo(output);
24
With Code Generation – Deserializing
DatumReader<Person> datumReader = new
SpecificDatumReader<Person>(Person.class);
byte[] bytes = Files.readAllBytes(new File(BIN_FILE_NAME_V1).toPath());
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(bytes, null);
Person person = datumReader.read(null, decoder);
System.out.println(person.getFirstName());
PersonWrapper.Person person =
PersonWrapper.Person.parseFrom(new
FileInputStream(BIN_FILE_NAME_V1));
System.out.println(person.getFirstName());
25
Encoding
• Field position (tag) are used as keys
• Variable length for int32 and int64
• + zig zag for sint32 and sint64
• data is serialized in field order of schema Variable
length, zig-zag for int and long, fixed length for float
and double
Variable length encoding: a method of serializing integers using one or more bytes
Zig Zag encoding: more efficient for negative numbers
26
Without code generation
final String schemaLoc = "src/main/avro/Person-v1.avsc";
final File schemaFile = new File(schemaLoc);
final Schema schema = new Schema.Parser().parse(schemaFile);
GenericRecord person1 = new GenericData.Record(schema);
person1.put("id", 1);
person1.put("firstName", "Peter");
person1.put("lastName", "Muster");
person1.put("title", "Mr");
person1.put("emailAddress", "peter.muster@somecorp.com");
person1.put("phoneNumber", "+41 79 345 34 44");
person1.put("faxNumber", "+41 31 322 33 22");
person1.put("dateOfBirth", new LocalDate("1995-11-10"));
27
Serializing to Object Container File
• file has schema and all
objects stored in the file
must be according to that
schema
• Objects are stored in
blocks that may be
compressed
final DatumWriter<Person> datumWriter = new
SpecificDatumWriter<>(Person.class);
final DataFileWriter<Person> dataFileWriter =
new DataFileWriter<>(datumWriter);
// use snappy compression
dataFileWriter.setCodec(CodecFactory.snappyCodec());
dataFileWriter.create(persons.get(0).getSchema(),
new File(CONTAINER_FILE_NAME_V1));
// specify block size
dataFileWriter.setSyncInterval(1000);
persons.forEach(person -> {
dataFileWriter.append(person);
});
28
Serializing to Object Container File
00000000: 4f62 6a01 0216 6176 726f 2e73 6368 656d 61f8 0d7b 2274 7970 6522 3a22 7265 636f 7264 Obj...avro.schema..{"type":"record
00000022: 222c 226e 616d 6522 3a22 5065 7273 6f6e 222c 226e 616d 6573 7061 6365 223a 2263 6f6d ","name":"Person","namespace":"com
00000044: 2e74 7269 7661 6469 732e 6176 726f 2e70 6572 736f 6e2e 7631 222c 2266 6965 6c64 7322 .trivadis.avro.person.v1","fields"
00000066: 3a5b 7b22 6e61 6d65 223a 2269 6422 2c22 7479 7065 223a 2269 6e74 222c 2264 6f63 223a :[{"name":"id","type":"int","doc":
00000088: 2269 6422 7d2c 7b22 6e61 6d65 223a 2266 6972 7374 4e61 6d65 222c 2274 7970 6522 3a22 "id"},{"name":"firstName","type":"
000000aa: 7374 7269 6e67 222c 2264 6f63 223a 2246 6972 7374 204e 616d 6522 7d2c 7b22 6e61 6d65 string","doc":"First Name"},{"name
000000cc: 223a 226c 6173 744e 616d 6522 2c22 7479 7065 223a 2273 7472 696e 6722 2c22 646f 6322 ":"lastName","type":"string","doc"
000000ee: 3a22 4c61 7374 204e 616d 6522 7d2c 7b22 6e61 6d65 223a 2274 6974 6c65 222c 2274 7970 :"Last Name"},{"name":"title","typ
00000110: 6522 3a7b 2274 7970 6522 3a22 656e 756d 222c 226e 616d 6522 3a22 5469 746c 6545 6e75 e":{"type":"enum","name":"TitleEnu
00000132: 6d22 2c22 646f 6322 3a22 5661 6c69 6420 7469 746c 6573 222c 2273 796d 626f 6c73 223a m","doc":"Valid titles","symbols":
00000154: 5b22 556e 6b6e 6f77 6e22 2c22 4d72 222c 224d 7273 222c 224d 7322 5d7d 2c22 646f 6322 ["Unknown","Mr","Mrs","Ms"]},"doc"
00000176: 3a22 7468 6520 7469 746c 6520 7573 6564 227d 2c7b 226e 616d 6522 3a22 656d 6169 6c41 :"the title used"},{"name":"emailA
00000198: 6464 7265 7373 222c 2274 7970 6522 3a5b 226e 756c 6c22 2c22 7374 7269 6e67 225d 2c22 ddress","type":["null","string"],"
000001ba: 646f 6322 3a22 227d 2c7b 226e 616d 6522 3a22 7068 6f6e 654e 756d 6265 7222 2c22 7479 doc":""},{"name":"phoneNumber","ty
000001dc: 7065 223a 5b22 6e75 6c6c 222c 2273 7472 696e 6722 5d2c 2264 6f63 223a 2222 7d2c 7b22 pe":["null","string"],"doc":""},{"
000001fe: 6e61 6d65 223a 2266 6178 4e75 6d62 6572 222c 2274 7970 6522 3a5b 226e 756c 6c22 2c22 name":"faxNumber","type":["null","
00000220: 7374 7269 6e67 225d 2c22 646f 6322 3a22 227d 2c7b 226e 616d 6522 3a22 6461 7465 4f66 string"],"doc":""},{"name":"dateOf
00000242: 4269 7274 6822 2c22 7479 7065 223a 7b22 7479 7065 223a 2269 6e74 222c 226c 6f67 6963 Birth","type":{"type":"int","logic
00000264: 616c 5479 7065 223a 2264 6174 6522 7d2c 2264 6f63 223a 2244 6174 6520 6f66 2042 6972 alType":"date"},"doc":"Date of Bir
00000286: 7468 227d 2c7b 226e 616d 6522 3a22 6164 6472 6573 7365 7322 2c22 7479 7065 223a 5b22 th"},{"name":"addresses","type":["
000002a8: 6e75 6c6c 222c 7b22 7479 7065 223a 2261 7272 6179 222c 2269 7465 6d73 223a 7b22 7479 null",{"type":"array","items":{"ty
000002ca: 7065 223a 2272 6563 6f72 6422 2c22 6e61 6d65 223a 2241 6464 7265 7373 222c 2266 6965 pe":"record","name":"Address","fie
000002ec: 6c64 7322 3a5b 7b22 6e61 6d65 223a 2269 6422 2c22 7479 7065 223a 2269 6e74 227d 2c7b lds":[{"name":"id","type":"int"},{
0000030e: 226e 616d 6522 3a22 7374 7265 6574 416e 644e 7222 2c22 7479 7065 223a 2273 7472 696e "name":"streetAndNr","type":"strin
00000330: 6722 7d2c 7b22 6e61 6d65 223a 227a 6970 416e 6443 6974 7922 2c22 7479 7065 223a 2273 g"},{"name":"zipAndCity","type":"s
00000352: 7472 696e 6722 7d5d 7d7d 5d7d 5d2c 2264 6573 6372 6970 7469 6f6e 223a 2274 6865 2072 tring"}]}}]}],"description":"the r
00000374: 6570 7265 7365 6e74 6174 696f 6e20 6f66 2061 2070 6572 736f 6e22 7d00 111d 965a be54 epresentation of a person"}....Z.T
00000396: 3682 1242 1863 02c2 982c 12f2 0f02 0a50 6574 6572 0c53 616d 706c 6502 0232 7065 7465 6..B.c...,.....Peter.Sample..2pete
000003b8: 722e 7361 6d70 6c65 4073 6f6d 6563 6f72 702e 636f 6d02 202b 3431 2037 3920 3334 3520 r.sample@somecorp.com. +41 79 345
000003da: 3334 2034 3402 202b 3431 2033 3120 3332 3220 3333 2032 32c8 9301 0202 021a 536f 6d65 34 44. +41 31 322 33 22.......Some
Avro Container File contains a header with the
Avro Schema used when writing the data
Synchronization markers are used between
data blocks to permit efficient splitting of files
29
Schema Evolution
Person (1.0)
• id : integer
• firstName : text
• lastName : text
• title : enum(unknown,mr,mrs,ms)
• emailAddress : text
• phoneNumber : text
• faxNumber : text
• dateOfBirth : date
• addresses : array<Address>
Address (1.0)
• streetAndNr : text
• zipAndCity : text
Person (1.1)
• id : integer
• firstName : text
• middleName : text
• lastName : text
• title : enum(unknown,mr,mrs,ms)
• emailAddress : text
• phoneNumber : text
• faxNumber : text
• addresses : array<Addresss>
Address (1.0)
• streetAndNr : text
• zipAndCity : text
V1.0 to V1.1
• Adding
middleName
• Remove
faxNumber
30
Schema Evolution
message Person {
int32 id = 1;
string first_name = 2;
string last_name = 3;
com.Trivadis.protobuf.lov.Title title = 4;
string email_address = 5;
string phone_number = 6;
string fax_number = 7;
google.protobuf.Timestamp
date_of_birth = 8;
repeated trivadis.protobuf.address.v1.Addresss
addresses = 9;
}
person-v1.proto
person-v1.1.proto
message Person {
int32 id = 1;
string first_name = 2;
string middle_name = 10;
string last_name = 3;
com.trivadis.protobuf.lov.Title title = 4;
string email_address = 5;
string phone_number = 6;
// string fax_number = 7;
google.protobuf.Timestamp
birth_date = 8;
repeated trivadis.protobuf.address.v1.Addresss
addresses = 9;
}
31
Schema Evolution
1 1
2 Peter
3 Sample
4 MR
5 peter.sample@somecorp.com
6 +41 79 345 34 44
7 +41 31 322 33 22
8 1995-11-10
9 1 1
9 2 Somestreet 10
9 3 9332 Somecity
1 1
2 Peter
3 Sample
4 MR
5 peter.sample@somecorp.com
6 +41 79 345 34 44
8 1995-11-10
9 1 1
9 2 Somestreet 10
9 3 9332 Somecity
10 Paul
unknown fields
7 +41 31 322 33 22
V1.0
V1.1
V1.0 to V1.1
32
Person-v1.avsc
Schema Evolution
{
"type" : "record",
"namespace" : "com.trivadis.avro.person.v1",
"name" : "Person",
"description" : "the representation of a person",
"fields" : [
{ "name": "id", "type": "int" },
{ "name": "firstName", "type": "string" },
{ "name": "lastName", "type": "string" },
{ "name" : "title", "type" : {
"type" : "enum", "name" : "TitleEnum",
"symbols" : ["Unknown", "Mr", "Mrs", "Ms"]
}
},
{ "name": "emailAddress", "type": ["null","string"] },
{ "name": "phoneNumber", "type": ["null","string"] },
{ "name": "faxNumber", "type": ["null","string"] },
{ "name": "dateOfBirth", "type": {
"type": "int", "logicalType": "date" } },
{ "name" : "addresses", ... }
]
}
{
"type" : "record",
...
"fields" : [
{ "name": "id", "type": "int" },
{ "name": "firstName", "type": "string" },
{ "name": "middleName",
"type": ["null","string"], "default":null },
{ "name": "lastName", "type": "string" },
{ "name" : "title", ... },
{ "name": "emailAddress", "type": ["null","string"] },
{ "name": "phoneNumber", "type": ["null","string"] },
{ "name" : "addresses", ... },
]
}
Person-v1.1.avsc
33
Schema Evolution
id 1
firstName Peter
lastName Sample
title MR
emailAddress peter.sample@somecorp.com
phoneNumber +41 79 345 34 44
faxNumber +41 31 322 33 22
dateOfBirth 1995-11-10
addresses.id 1
addresses.streetAndNr Somestreet 10
addresses.zipAndCity 9332 Somecity
V1.0 to V1.1
V1.0 V1.1
id 1
firstName Peter
middleName Paul
lastName Sample
title MR
emailAddress peter.sample@somecorp.com
phoneNumber +41 79 345 34 44
dateOfBirth 1995-11-10
addresses.id 1
addresses.streetAndNr Somestreet 10
addresses.zipAndCity 9332 Somecity
34
Big Data & Fast Data
35
Avro & Protobuf with Kafka
Source
Connector
Kafka Broker
Sink
Connector
Stream
Processing
Schema
Registry
Kafka Kafka
36
Avro and Kafka – Schema Registry
<plugin>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry-maven-plugin</artifactId>
<version>4.0.0</version>
<configuration>
<schemaRegistryUrls>
<param>http://172.16.10.10:8081</param>
</schemaRegistryUrls>
<subjects>
<person-v1-value>src/main/avro/Person-v1.avsc</person-v1-value>
</subjects>
</configuration>
<goals>
<goal>register</goal>
</goals>
</plugin>
mnv schema-registry:register
curl -X "GET" "http://172.16.10.10:8081/subjects"
37
Avro and Kafka – Producing Avro to Kafka
@Configuration
public class KafkaConfig {
private String bootstrapServers;
private String schemaRegistryURL;
@Bean
public Map<String, Object> producerConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
props.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryURL);
return props;
}
@Bean
public ProducerFactory<String, Person> producerFactory() { .. }
@Bean
public KafkaTemplate<String, Person> kafkaTemplate() {
return new KafkaTemplate<>(producerFactory());
}
@Component
public class PersonEventProducer {
@Autowired
private KafkaTemplate<String, Person> kafkaTemplate;
@Value("${kafka.topic.person}")
String kafkaTopic;
public void produce(Person person) {
kafkaTemplate.send(kafkaTopic, person.getId().toString(), person);
}
}
38
Avro and Big Data
• Avro is widely supported by Big Data Frameworks: Hadoop MapReduce, Pig, Hive, Sqoop, Apache
Spark, …
• Spark Avro DataSource for Apache Spark supports using Avro as a source for DataFrames:
https://spark.apache.org/docs/latest/sql-data-sources-avro.html
import com.databricks.spark.avro._
val personDF = spark.read.avro("person-v1.avro")
personDF.createOrReplaceTempView("person")
val subPersonDF =
spark.sql("select * from person where firstName like 'G%'")
libraryDependencies += ”org.apache.spark" %% "spark-avro" % ”2.12:3.0.1"
39
There is more! Column-oriented: Apache Parquet and ORC
A logical table can be translated using either
• Row-based layout (Avro, Protobuf, JSON, …)
• Column-oriented layout (Parquet, ORC, …)
Apache Parquet
• collaboration between Twitter and
Cloudera
• Support in Hadoop, Hive, Spark, Apache
NiFi, StreamSets, Apache Pig, …
Apache ORC
• was created by Facebook and
Hortonworks
• Support in Hadoop, Hive, Spark, Apache
NiFi, Apache Pig, Presto, …
A B C
A1 B1 C1
A2 B2 C2
A2 B2 C2
A1 B1 C1 A2 B2 C2 A3 B3 C3
A1 A2 A3 B1 B2 B3 C1 C2 C3
40
Parquet and Big Data
• Avro is widely supported by Big Data Frameworks: Hadoop MapReduce, Pig, Hive, Sqoop, Apache
Spark, …
• Spark Avro DataSource for Apache Spark supports using Avro as a source for DataFrames:
https://github.com/databricks/spark-avro
import com.databricks.spark.avro._
val personDF = spark.read.avro("person-v1.avro")
personDF.createOrReplaceTempView("Person")
val subPersonDF =
spark.sql("select * from Person where firstName like 'G%'")
libraryDependencies += ”org.apache.spark" %% "spark-avro" % ”2.12:3.0.1"
41
Delta Lake - http://delta.io
• Delta Lake is an open source storage layer that brings reliability to data lakes
• First part of Databricks Platform, now open-sourced
• Delta Lake provides
• Fully compatible with Apache Spark
• ACID transactions
• Update and Delete on Big Data Storage
• Schema enforcement
• Time Travel (Data versioning)
• Scalable metadata handling
• Open Format (Parquet)
• Unified streaming and batch data processing
• Schema Evolution
• Audit History
• Integration with Presto/Athena/Hive/Amazon Redshift/Snowflake for read
42
Delta Lake - http://delta.io
43
Other Delta Lake Storage Layers
Apache Hudi
• https://hudi.apache.org/
• Ingests & manages storage of large analytical datasets over DFS
Apache Iceberg
• https://iceberg.apache.org
• Open table format for huge analytic datasets
• Adds tables to Presto and Spark that use a high-performance format
44
https://medium.com/@domisj/comparison-of-big-data-storage-layers-delta-vs-apache-hudi-vs-apache-iceberg-part-1-200599645a02
gRPC & Protbuf
45
Protobuf and gRPC
• https://grpc.io/
• Google's high performance, open-source
universal RPC framework
• layering on top of HTTP/2 and using protocol
buffers to define messages
• Support for Java, C#, C++, Python, Go, Ruby,
Node.js, Objective-C, …
Source: https://thenewstack.io
46
Summary
47
Serialization / Deserialization
Service / Client
Logic
Event
Broker
Publish-
Subscribe
Data Lake
Service
{ }
API Logic
REST
Parallel
ProcessingStorage
Raw
serialize deserialize
serializedeserialize
serialize
deserialize
deserialize
serialize
Storage
Refined
Integration
Data Flow
serialize
serializedeserialize
Stream Analyticsdeserialize
serialize
ResultsStream Analytics
Streaming
Source
serialize
48
You are welcome to join us at the Expo area.
We're looking forward to meeting you.
Link to the Expo area:
https://www.vinivia-event-
manager.io/e/DOAG/portal/expo/29731
51

Contenu connexe

Tendances

Zero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with NettyZero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with NettyDaniel Bimschas
 
Getting started with DSpace 7 REST API
Getting started with DSpace 7 REST APIGetting started with DSpace 7 REST API
Getting started with DSpace 7 REST API4Science
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019Amit Banerjee
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph DatabaseTobias Lindaaker
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Ontico
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureDan McKinley
 
stupid-simple-kubernetes-final.pdf
stupid-simple-kubernetes-final.pdfstupid-simple-kubernetes-final.pdf
stupid-simple-kubernetes-final.pdfDaniloQueirozMota
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
 

Tendances (20)

Zero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with NettyZero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with Netty
 
Getting started with DSpace 7 REST API
Getting started with DSpace 7 REST APIGetting started with DSpace 7 REST API
Getting started with DSpace 7 REST API
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...
 
Etsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds Architecture
 
stupid-simple-kubernetes-final.pdf
stupid-simple-kubernetes-final.pdfstupid-simple-kubernetes-final.pdf
stupid-simple-kubernetes-final.pdf
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 

Similaire à Avro vs Protobuf Serialization Formats

Philip Stehlik at TechTalks.ph - Intro to Groovy and Grails
Philip Stehlik at TechTalks.ph - Intro to Groovy and GrailsPhilip Stehlik at TechTalks.ph - Intro to Groovy and Grails
Philip Stehlik at TechTalks.ph - Intro to Groovy and GrailsPhilip Stehlik
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchVic Hargrave
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.jsorkaplan
 
Future-proof Development for Classic SharePoint
Future-proof Development for Classic SharePointFuture-proof Development for Classic SharePoint
Future-proof Development for Classic SharePointBob German
 
SOLID Programming with Portable Class Libraries
SOLID Programming with Portable Class LibrariesSOLID Programming with Portable Class Libraries
SOLID Programming with Portable Class LibrariesVagif Abilov
 
C # (C Sharp).pptx
C # (C Sharp).pptxC # (C Sharp).pptx
C # (C Sharp).pptxSnapeSever
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Everything-as-code. A polyglot adventure. #DevoxxPL
Everything-as-code. A polyglot adventure. #DevoxxPLEverything-as-code. A polyglot adventure. #DevoxxPL
Everything-as-code. A polyglot adventure. #DevoxxPLMario-Leander Reimer
 
Everything-as-code - A polyglot adventure
Everything-as-code - A polyglot adventureEverything-as-code - A polyglot adventure
Everything-as-code - A polyglot adventureQAware GmbH
 
OrientDB introduction - NoSQL
OrientDB introduction - NoSQLOrientDB introduction - NoSQL
OrientDB introduction - NoSQLLuca Garulli
 
EWD 3 Training Course Part 1: How Node.js Integrates With Global Storage Data...
EWD 3 Training Course Part 1: How Node.js Integrates With Global Storage Data...EWD 3 Training Course Part 1: How Node.js Integrates With Global Storage Data...
EWD 3 Training Course Part 1: How Node.js Integrates With Global Storage Data...Rob Tweed
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure DataTaro L. Saito
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)lennartkats
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Python And The MySQL X DevAPI - PyCaribbean 2019
Python And The MySQL X DevAPI - PyCaribbean 2019Python And The MySQL X DevAPI - PyCaribbean 2019
Python And The MySQL X DevAPI - PyCaribbean 2019Dave Stokes
 
1.6 米嘉 gobuildweb
1.6 米嘉 gobuildweb1.6 米嘉 gobuildweb
1.6 米嘉 gobuildwebLeo Zhou
 
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...BalaBit
 

Similaire à Avro vs Protobuf Serialization Formats (20)

Philip Stehlik at TechTalks.ph - Intro to Groovy and Grails
Philip Stehlik at TechTalks.ph - Intro to Groovy and GrailsPhilip Stehlik at TechTalks.ph - Intro to Groovy and Grails
Philip Stehlik at TechTalks.ph - Intro to Groovy and Grails
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with Elasticsearch
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.js
 
Future-proof Development for Classic SharePoint
Future-proof Development for Classic SharePointFuture-proof Development for Classic SharePoint
Future-proof Development for Classic SharePoint
 
SOLID Programming with Portable Class Libraries
SOLID Programming with Portable Class LibrariesSOLID Programming with Portable Class Libraries
SOLID Programming with Portable Class Libraries
 
C # (C Sharp).pptx
C # (C Sharp).pptxC # (C Sharp).pptx
C # (C Sharp).pptx
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Avro
AvroAvro
Avro
 
Everything-as-code. A polyglot adventure. #DevoxxPL
Everything-as-code. A polyglot adventure. #DevoxxPLEverything-as-code. A polyglot adventure. #DevoxxPL
Everything-as-code. A polyglot adventure. #DevoxxPL
 
Everything-as-code - A polyglot adventure
Everything-as-code - A polyglot adventureEverything-as-code - A polyglot adventure
Everything-as-code - A polyglot adventure
 
OrientDB introduction - NoSQL
OrientDB introduction - NoSQLOrientDB introduction - NoSQL
OrientDB introduction - NoSQL
 
EWD 3 Training Course Part 1: How Node.js Integrates With Global Storage Data...
EWD 3 Training Course Part 1: How Node.js Integrates With Global Storage Data...EWD 3 Training Course Part 1: How Node.js Integrates With Global Storage Data...
EWD 3 Training Course Part 1: How Node.js Integrates With Global Storage Data...
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Python And The MySQL X DevAPI - PyCaribbean 2019
Python And The MySQL X DevAPI - PyCaribbean 2019Python And The MySQL X DevAPI - PyCaribbean 2019
Python And The MySQL X DevAPI - PyCaribbean 2019
 
1.6 米嘉 gobuildweb
1.6 米嘉 gobuildweb1.6 米嘉 gobuildweb
1.6 米嘉 gobuildweb
 
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
2015. Libre Software Meeting - syslog-ng: from log collection to processing a...
 

Plus de Guido Schmutz

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as CodeGuido Schmutz
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureGuido Schmutz
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureGuido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureGuido Schmutz
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaGuido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaGuido Schmutz
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaGuido Schmutz
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming VisualisationGuido Schmutz
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Guido Schmutz
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaGuido Schmutz
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureGuido Schmutz
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Guido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 

Plus de Guido Schmutz (20)

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data Architecture
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 

Dernier

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Dernier (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Avro vs Protobuf Serialization Formats

  • 1. http://guidoschmutz@wordpress.com@gschmutz Big Data, Data Lake, Datenserialisierungsformate Guido Schmutz
  • 2. Guido Working at Trivadis for more than 23 years Consultant, Trainer, Platform Architect for Java, Oracle, SOA and Big Data / Fast Data Oracle Groundbreaker Ambassador & Oracle ACE Director @gschmutz guidoschmutz.wordpress.com 210th edition
  • 3.
  • 4. Agenda • Introduction • Avro vs. Protobuf • Serialization in Big Data, Data Lake & Fast Data • Protobuf and gRPC • Summary https://bit.ly/2zz0CV4 4
  • 6. What is Serialization / Deserialization ? Serialization is the process of turning structured in-memory objects into a byte stream for transmission over a network or for writing to persistent storage Deserialization is the reverse process from a byte stream back to a series of structured in-memory objects When selecting a data serialization format, the following characteristics should be evaluated: • Schema support and Schema evolution • Code generation • Language support / Interoperability • Transparent compression • Splitability • Support in Big Data / Fast Data Ecosystem 6
  • 7. Where do we need Serialization / Deserialization ? Service / Client Logic Event Broker Publish- Subscribe Data Lake Service { } API Logic REST Parallel ProcessingStorage Raw serialize deserialize serializedeserialize serialize deserialize deserialize serialize Storage Refined Integration Data Flow serialize serializedeserialize Stream Analyticsdeserialize serialize ResultsStream Analytics Streaming Source serialize 7
  • 8. Sample Data Structured used in this presentation Person (1.0) • id : integer • firstName : text • lastName : text • title : enum(unknown,mr,mrs,ms) • emailAddress : text • phoneNumber : text • faxNumber : text • dateOfBirth : date • addresses : array<Address> Address (1.0) • streetAndNr : text • zipAndCity : text { "id":"1", "firstName":"Peter", "lastName":"Sample", "title":"mr", "emailAddress":"peter.sample@somecorp.com", "phoneNumber":"+41 79 345 34 44", "faxNumber":"+41 31 322 33 22", "dateOfBirth":"1995-11-10", "addresses":[ { "id":"1", "streetAndNr":"Somestreet 10", "zipAndCity":"9332 Somecity" } ] } https://github.com/gschmutz/various-demos/tree/master/avro-vs-protobuf 8
  • 10. Google Protocol Buffers • https://developers.google.com/protocol-buffers/ • Protocol buffers (protobuf) are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data • like XML, but smaller, faster, and simpler • Schema is needed to generate code and read/write data • Support generated code in Java, Python, Objective-C, C++, Go, Ruby, and C# • Two different versions: proto2 and proto3 • Presentation based on proto3 • Latest version: 3.13.0 10
  • 11. Apache Avro • http://avro.apache.org/docs/current/ • Apache Avro™ is a compact, fast, binary data serialization system invented by the makers of Hadoop • Avro relies on schemas. When data is read, the schema used when writing it is always present • container file for storing persistent data • Works both with code generation as well as in a dynamic manner • Latest version: 1.10.0 11
  • 13. Defining Schema - IDL syntax = "proto3"; package com.trivadis.protobuf.person.v1; import "address-v1.proto"; import "title-enum-v1.proto"; import "google/protobuf/timestamp.proto"; option java_outer_classname = "PersonWrapper"; message Person { int32 id = 1; string first_name = 2; string last_name = 3; com.Trivadis.protobuf.lov.Title title = 4; string email_address = 5; string phone_number = 6; string fax_number = 7; google.protobuf.Timestamp date_of_birth = 8; repeated trivadis.protobuf.address.v1.Addresss addresses = 9; } syntax = "proto3"; package com.trivadis.protobuf.lov; enum Title { UNKNOWN = 0; MR = 1; MRS = 2; MS = 3; } syntax = "proto3"; package com.trivadis.protobuf.address.v1; message Addresss { int32 id = 1; string street_and_nr = 2; string zip_and_city = 3; } person-v1.proto title-v1.proto address-v1.proto https://developers.google.com/protocol-buffers/docs/proto3 13
  • 14. Defining Schema – JSON Person-v1.avsc { "type" : "record", "namespace" : "com.trivadis.avro.person.v1", "name" : "Person", "description" : "the representation of a person", "fields" : [ { "name": "id", "type": "int" }, { "name": "firstName", "type": "string" }, { "name": "lastName", "type": "string" }, { "name" : "title", "type" : { "type" : "enum", "name" : "TitleEnum", "symbols" : ["Unknown", "Mr", "Mrs", "Ms"] } }, { "name": "emailAddress", "type": ["null","string"] }, { "name": "phoneNumber", "type": ["null","string"] }, { "name": "faxNumber", "type": ["null","string"] }, { "name": "dateOfBirth", "type": { "type": "int", "logicalType": "date" } }, { "name" : "addresses", ... } ] } https://avro.apache.org/docs/current/spec.html 14
  • 15. Defining Schema - IDL @namespace("com.trivadis.avro.person.v1") protocol PersonIdl { import idl "Address-v1.avdl"; enum TitleEnum { Unknown, Mr, Ms, Mrs } record Person { int id; string firstName; string lastName; TitleEnum title; union { null, string } emailAddress; union { null, string } phoneNumber; union { null, string } faxNumber; date dateOfBirth; array<com.trivadis.avro.address.v1.Address> addresses; } } @namespace("com.trivadis.avro.address.v1") protocol AddressIdl { record Address { int id; string streetAndNr; string zipAndCity; } } Note: JSON Schema can be generated from IDL Schema using Avro Tools address-v1.avdl Person-v1.avdl https://avro.apache.org/docs/current/idl.html 15
  • 16. Defining Schema - Specification • Multiple message types can be defined in single proto file • Field Numbers – each field in the message has a unique number • used to identify the fields in the message binary format • should not be changed once message type is in use • 1 – 15 uses single byte, 16 – 2047 uses two bytes to encode • Default values are type-specific • Schema can either be represented as JSON or by using the IDL • Avro specifies two serialization encodings: binary and JSON • Encoding is done in order of fields defined in record • schema used to write data always needs to be available when the data is read • Schema can be serialized with the data or • Schema is made available through registry 16
  • 17. Defining Schema - Data Types • Scalar Types • double, float, int32, int64, uint32, uint64, sint32, sint64, fixed32, fixed64, sfixed32, sfixed64 • bool • string • bytes • Embedded Messages • Enumerations • Repeated • Scalar Types • null • int, long, float, double • boolean • string • bytes • Records • Map (string, Schema) • Arrays (Schema) • Enumerations • Union • Logical Types 17
  • 18. Defining Schema - Style Guides • Use CamelCase (with an initial capital) for message names • Use underscore_separated_names for field names • Use CamelCase (with an initial capital) for enum type names • Use CAPITALS_WITH_UNDERSCORES for enum value names • Use java-style comments for documenting • Use CamelCase (with an initial capital) for record names • Use Camel Case for field names • Use CamelCase (with an initial capital) for enum type names • Use CAPITALS_WITH_UNDERSCORES for enum value names • Use java-style comments (IDL) or doc property (JSON) for documenting 18
  • 19. IDE Support Eclipse • https://marketplace.eclipse.org/content/protobuf-dt IntelliJ • https://plugins.jetbrains.com/plugin/8277-protobuf- support Eclipse • https://marketplace.eclipse.org/content/avroclipse IntelliJ • https://plugins.jetbrains.com/plugin/7971-apache- avro-support 19
  • 20. With Code Generation – Generate the code • Run the protocol buffer compiler • One compiler for all supported languages • Produces classes for the given language • Run the specific tool for the given language • For Java • For C++ • For C# protoc -I=$SRC_DIR --java_out=$DST_DIR $SRC_DIR/person-v1.proto java -jar /path/to/avro-tools-1.8.2.jar compile schema Person-v1.avsc . avrogencpp -i cpx.json -o cpx.hh -n c Microsoft.Hadoop.Avro.Tools codegen /i:C:SDKsrcMicrosoft.Hadoop.Avro.Tool sSampleJSONSampleJSONSchema.avsc /o: 20
  • 21. With Code Generation – Using Maven • Use protobuf-maven-plugin for generating code at maven build • Generates to target/generated-sources • Scans all project dependencies for .proto files • protoc has to be installed on machine • Use avro-maven-plugin for generating code at maven build • Generates to target/generated-sources 21
  • 22. Using Protobuf and Avro from Java if you are using Maven, add the following dependency to your POM: <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro</artifactId> <version>1.10.0</version> </dependency> <dependency> <groupId>com.google.protobuf</groupId> <artifactId>protobuf-java</artifactId> <version>3.13.0</version> </dependency> 22
  • 23. With Code Generation – Create an instance addresses.add(Address.newBuilder() .setId(1) .setStreetAndNr("Somestreet 10") .setZipAndCity("9332 Somecity").build()) Person person = Person.newBuilder() .setId(1) .setFirstName("Peter") .setLastName("Muster") .setEmailAddress("peter.muster@somecorp.com") .setPhoneNumber("+41 79 345 34 44") .setFaxNumber("+41 31 322 33 22") .setTitle(TitleEnum.Mr) .setDateOfBirth(LocalDate.parse("1995-11-10")) .setAddresses(addresses).build(); addresses.add(Addresss.newBuilder() .setId(1) .setStreetAndNr("Somestreet 10") .setZipAndCity("9332 Somecity").build()); Instant time = Instant .parse("1995-11-10T00:00:00.00Z"); Timestamp timestamp = Timestamp.newBuilder() .setSeconds(time.getEpochSecond()) .setNanos(time.getNano()).build(); Person person = Person.newBuilder() .setId(1) .setFirstName("Peter") .setLastName("Muster") .setEmailAddress("peter.muster@somecorp.com") .setPhoneNumber("+41 79 345 34 34") .setFaxNumber("+41 31 322 33 22") .setTitle(TitleEnumWrapper.Title.MR) .setDateOfBirth(timestamp) .addAllAddresses(addresses).build(); 23
  • 24. With Code Generation – Serializing FileOutputStream fos = new FileOutputStream(BIN_FILE_NAME_V1)); ByteArrayOutputStream out = new ByteArrayOutputStream(); DatumWriter<Person> writer = new SpecificDatumWriter<Person>(Person.getClassSchema()); writer.write(person, EncoderFactory.get().binaryEncoder(out, null)); encoder.flush(); out.close(); byte[] serializedBytes = out.toByteArray(); fos.write(serializedBytes); FileOutputStream output = new FileOutputStream(BIN_FILE_NAME_V2); person.writeTo(output); 24
  • 25. With Code Generation – Deserializing DatumReader<Person> datumReader = new SpecificDatumReader<Person>(Person.class); byte[] bytes = Files.readAllBytes(new File(BIN_FILE_NAME_V1).toPath()); BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(bytes, null); Person person = datumReader.read(null, decoder); System.out.println(person.getFirstName()); PersonWrapper.Person person = PersonWrapper.Person.parseFrom(new FileInputStream(BIN_FILE_NAME_V1)); System.out.println(person.getFirstName()); 25
  • 26. Encoding • Field position (tag) are used as keys • Variable length for int32 and int64 • + zig zag for sint32 and sint64 • data is serialized in field order of schema Variable length, zig-zag for int and long, fixed length for float and double Variable length encoding: a method of serializing integers using one or more bytes Zig Zag encoding: more efficient for negative numbers 26
  • 27. Without code generation final String schemaLoc = "src/main/avro/Person-v1.avsc"; final File schemaFile = new File(schemaLoc); final Schema schema = new Schema.Parser().parse(schemaFile); GenericRecord person1 = new GenericData.Record(schema); person1.put("id", 1); person1.put("firstName", "Peter"); person1.put("lastName", "Muster"); person1.put("title", "Mr"); person1.put("emailAddress", "peter.muster@somecorp.com"); person1.put("phoneNumber", "+41 79 345 34 44"); person1.put("faxNumber", "+41 31 322 33 22"); person1.put("dateOfBirth", new LocalDate("1995-11-10")); 27
  • 28. Serializing to Object Container File • file has schema and all objects stored in the file must be according to that schema • Objects are stored in blocks that may be compressed final DatumWriter<Person> datumWriter = new SpecificDatumWriter<>(Person.class); final DataFileWriter<Person> dataFileWriter = new DataFileWriter<>(datumWriter); // use snappy compression dataFileWriter.setCodec(CodecFactory.snappyCodec()); dataFileWriter.create(persons.get(0).getSchema(), new File(CONTAINER_FILE_NAME_V1)); // specify block size dataFileWriter.setSyncInterval(1000); persons.forEach(person -> { dataFileWriter.append(person); }); 28
  • 29. Serializing to Object Container File 00000000: 4f62 6a01 0216 6176 726f 2e73 6368 656d 61f8 0d7b 2274 7970 6522 3a22 7265 636f 7264 Obj...avro.schema..{"type":"record 00000022: 222c 226e 616d 6522 3a22 5065 7273 6f6e 222c 226e 616d 6573 7061 6365 223a 2263 6f6d ","name":"Person","namespace":"com 00000044: 2e74 7269 7661 6469 732e 6176 726f 2e70 6572 736f 6e2e 7631 222c 2266 6965 6c64 7322 .trivadis.avro.person.v1","fields" 00000066: 3a5b 7b22 6e61 6d65 223a 2269 6422 2c22 7479 7065 223a 2269 6e74 222c 2264 6f63 223a :[{"name":"id","type":"int","doc": 00000088: 2269 6422 7d2c 7b22 6e61 6d65 223a 2266 6972 7374 4e61 6d65 222c 2274 7970 6522 3a22 "id"},{"name":"firstName","type":" 000000aa: 7374 7269 6e67 222c 2264 6f63 223a 2246 6972 7374 204e 616d 6522 7d2c 7b22 6e61 6d65 string","doc":"First Name"},{"name 000000cc: 223a 226c 6173 744e 616d 6522 2c22 7479 7065 223a 2273 7472 696e 6722 2c22 646f 6322 ":"lastName","type":"string","doc" 000000ee: 3a22 4c61 7374 204e 616d 6522 7d2c 7b22 6e61 6d65 223a 2274 6974 6c65 222c 2274 7970 :"Last Name"},{"name":"title","typ 00000110: 6522 3a7b 2274 7970 6522 3a22 656e 756d 222c 226e 616d 6522 3a22 5469 746c 6545 6e75 e":{"type":"enum","name":"TitleEnu 00000132: 6d22 2c22 646f 6322 3a22 5661 6c69 6420 7469 746c 6573 222c 2273 796d 626f 6c73 223a m","doc":"Valid titles","symbols": 00000154: 5b22 556e 6b6e 6f77 6e22 2c22 4d72 222c 224d 7273 222c 224d 7322 5d7d 2c22 646f 6322 ["Unknown","Mr","Mrs","Ms"]},"doc" 00000176: 3a22 7468 6520 7469 746c 6520 7573 6564 227d 2c7b 226e 616d 6522 3a22 656d 6169 6c41 :"the title used"},{"name":"emailA 00000198: 6464 7265 7373 222c 2274 7970 6522 3a5b 226e 756c 6c22 2c22 7374 7269 6e67 225d 2c22 ddress","type":["null","string"]," 000001ba: 646f 6322 3a22 227d 2c7b 226e 616d 6522 3a22 7068 6f6e 654e 756d 6265 7222 2c22 7479 doc":""},{"name":"phoneNumber","ty 000001dc: 7065 223a 5b22 6e75 6c6c 222c 2273 7472 696e 6722 5d2c 2264 6f63 223a 2222 7d2c 7b22 pe":["null","string"],"doc":""},{" 000001fe: 6e61 6d65 223a 2266 6178 4e75 6d62 6572 222c 2274 7970 6522 3a5b 226e 756c 6c22 2c22 name":"faxNumber","type":["null"," 00000220: 7374 7269 6e67 225d 2c22 646f 6322 3a22 227d 2c7b 226e 616d 6522 3a22 6461 7465 4f66 string"],"doc":""},{"name":"dateOf 00000242: 4269 7274 6822 2c22 7479 7065 223a 7b22 7479 7065 223a 2269 6e74 222c 226c 6f67 6963 Birth","type":{"type":"int","logic 00000264: 616c 5479 7065 223a 2264 6174 6522 7d2c 2264 6f63 223a 2244 6174 6520 6f66 2042 6972 alType":"date"},"doc":"Date of Bir 00000286: 7468 227d 2c7b 226e 616d 6522 3a22 6164 6472 6573 7365 7322 2c22 7479 7065 223a 5b22 th"},{"name":"addresses","type":[" 000002a8: 6e75 6c6c 222c 7b22 7479 7065 223a 2261 7272 6179 222c 2269 7465 6d73 223a 7b22 7479 null",{"type":"array","items":{"ty 000002ca: 7065 223a 2272 6563 6f72 6422 2c22 6e61 6d65 223a 2241 6464 7265 7373 222c 2266 6965 pe":"record","name":"Address","fie 000002ec: 6c64 7322 3a5b 7b22 6e61 6d65 223a 2269 6422 2c22 7479 7065 223a 2269 6e74 227d 2c7b lds":[{"name":"id","type":"int"},{ 0000030e: 226e 616d 6522 3a22 7374 7265 6574 416e 644e 7222 2c22 7479 7065 223a 2273 7472 696e "name":"streetAndNr","type":"strin 00000330: 6722 7d2c 7b22 6e61 6d65 223a 227a 6970 416e 6443 6974 7922 2c22 7479 7065 223a 2273 g"},{"name":"zipAndCity","type":"s 00000352: 7472 696e 6722 7d5d 7d7d 5d7d 5d2c 2264 6573 6372 6970 7469 6f6e 223a 2274 6865 2072 tring"}]}}]}],"description":"the r 00000374: 6570 7265 7365 6e74 6174 696f 6e20 6f66 2061 2070 6572 736f 6e22 7d00 111d 965a be54 epresentation of a person"}....Z.T 00000396: 3682 1242 1863 02c2 982c 12f2 0f02 0a50 6574 6572 0c53 616d 706c 6502 0232 7065 7465 6..B.c...,.....Peter.Sample..2pete 000003b8: 722e 7361 6d70 6c65 4073 6f6d 6563 6f72 702e 636f 6d02 202b 3431 2037 3920 3334 3520 r.sample@somecorp.com. +41 79 345 000003da: 3334 2034 3402 202b 3431 2033 3120 3332 3220 3333 2032 32c8 9301 0202 021a 536f 6d65 34 44. +41 31 322 33 22.......Some Avro Container File contains a header with the Avro Schema used when writing the data Synchronization markers are used between data blocks to permit efficient splitting of files 29
  • 30. Schema Evolution Person (1.0) • id : integer • firstName : text • lastName : text • title : enum(unknown,mr,mrs,ms) • emailAddress : text • phoneNumber : text • faxNumber : text • dateOfBirth : date • addresses : array<Address> Address (1.0) • streetAndNr : text • zipAndCity : text Person (1.1) • id : integer • firstName : text • middleName : text • lastName : text • title : enum(unknown,mr,mrs,ms) • emailAddress : text • phoneNumber : text • faxNumber : text • addresses : array<Addresss> Address (1.0) • streetAndNr : text • zipAndCity : text V1.0 to V1.1 • Adding middleName • Remove faxNumber 30
  • 31. Schema Evolution message Person { int32 id = 1; string first_name = 2; string last_name = 3; com.Trivadis.protobuf.lov.Title title = 4; string email_address = 5; string phone_number = 6; string fax_number = 7; google.protobuf.Timestamp date_of_birth = 8; repeated trivadis.protobuf.address.v1.Addresss addresses = 9; } person-v1.proto person-v1.1.proto message Person { int32 id = 1; string first_name = 2; string middle_name = 10; string last_name = 3; com.trivadis.protobuf.lov.Title title = 4; string email_address = 5; string phone_number = 6; // string fax_number = 7; google.protobuf.Timestamp birth_date = 8; repeated trivadis.protobuf.address.v1.Addresss addresses = 9; } 31
  • 32. Schema Evolution 1 1 2 Peter 3 Sample 4 MR 5 peter.sample@somecorp.com 6 +41 79 345 34 44 7 +41 31 322 33 22 8 1995-11-10 9 1 1 9 2 Somestreet 10 9 3 9332 Somecity 1 1 2 Peter 3 Sample 4 MR 5 peter.sample@somecorp.com 6 +41 79 345 34 44 8 1995-11-10 9 1 1 9 2 Somestreet 10 9 3 9332 Somecity 10 Paul unknown fields 7 +41 31 322 33 22 V1.0 V1.1 V1.0 to V1.1 32
  • 33. Person-v1.avsc Schema Evolution { "type" : "record", "namespace" : "com.trivadis.avro.person.v1", "name" : "Person", "description" : "the representation of a person", "fields" : [ { "name": "id", "type": "int" }, { "name": "firstName", "type": "string" }, { "name": "lastName", "type": "string" }, { "name" : "title", "type" : { "type" : "enum", "name" : "TitleEnum", "symbols" : ["Unknown", "Mr", "Mrs", "Ms"] } }, { "name": "emailAddress", "type": ["null","string"] }, { "name": "phoneNumber", "type": ["null","string"] }, { "name": "faxNumber", "type": ["null","string"] }, { "name": "dateOfBirth", "type": { "type": "int", "logicalType": "date" } }, { "name" : "addresses", ... } ] } { "type" : "record", ... "fields" : [ { "name": "id", "type": "int" }, { "name": "firstName", "type": "string" }, { "name": "middleName", "type": ["null","string"], "default":null }, { "name": "lastName", "type": "string" }, { "name" : "title", ... }, { "name": "emailAddress", "type": ["null","string"] }, { "name": "phoneNumber", "type": ["null","string"] }, { "name" : "addresses", ... }, ] } Person-v1.1.avsc 33
  • 34. Schema Evolution id 1 firstName Peter lastName Sample title MR emailAddress peter.sample@somecorp.com phoneNumber +41 79 345 34 44 faxNumber +41 31 322 33 22 dateOfBirth 1995-11-10 addresses.id 1 addresses.streetAndNr Somestreet 10 addresses.zipAndCity 9332 Somecity V1.0 to V1.1 V1.0 V1.1 id 1 firstName Peter middleName Paul lastName Sample title MR emailAddress peter.sample@somecorp.com phoneNumber +41 79 345 34 44 dateOfBirth 1995-11-10 addresses.id 1 addresses.streetAndNr Somestreet 10 addresses.zipAndCity 9332 Somecity 34
  • 35. Big Data & Fast Data 35
  • 36. Avro & Protobuf with Kafka Source Connector Kafka Broker Sink Connector Stream Processing Schema Registry Kafka Kafka 36
  • 37. Avro and Kafka – Schema Registry <plugin> <groupId>io.confluent</groupId> <artifactId>kafka-schema-registry-maven-plugin</artifactId> <version>4.0.0</version> <configuration> <schemaRegistryUrls> <param>http://172.16.10.10:8081</param> </schemaRegistryUrls> <subjects> <person-v1-value>src/main/avro/Person-v1.avsc</person-v1-value> </subjects> </configuration> <goals> <goal>register</goal> </goals> </plugin> mnv schema-registry:register curl -X "GET" "http://172.16.10.10:8081/subjects" 37
  • 38. Avro and Kafka – Producing Avro to Kafka @Configuration public class KafkaConfig { private String bootstrapServers; private String schemaRegistryURL; @Bean public Map<String, Object> producerConfigs() { Map<String, Object> props = new HashMap<>(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class); props.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryURL); return props; } @Bean public ProducerFactory<String, Person> producerFactory() { .. } @Bean public KafkaTemplate<String, Person> kafkaTemplate() { return new KafkaTemplate<>(producerFactory()); } @Component public class PersonEventProducer { @Autowired private KafkaTemplate<String, Person> kafkaTemplate; @Value("${kafka.topic.person}") String kafkaTopic; public void produce(Person person) { kafkaTemplate.send(kafkaTopic, person.getId().toString(), person); } } 38
  • 39. Avro and Big Data • Avro is widely supported by Big Data Frameworks: Hadoop MapReduce, Pig, Hive, Sqoop, Apache Spark, … • Spark Avro DataSource for Apache Spark supports using Avro as a source for DataFrames: https://spark.apache.org/docs/latest/sql-data-sources-avro.html import com.databricks.spark.avro._ val personDF = spark.read.avro("person-v1.avro") personDF.createOrReplaceTempView("person") val subPersonDF = spark.sql("select * from person where firstName like 'G%'") libraryDependencies += ”org.apache.spark" %% "spark-avro" % ”2.12:3.0.1" 39
  • 40. There is more! Column-oriented: Apache Parquet and ORC A logical table can be translated using either • Row-based layout (Avro, Protobuf, JSON, …) • Column-oriented layout (Parquet, ORC, …) Apache Parquet • collaboration between Twitter and Cloudera • Support in Hadoop, Hive, Spark, Apache NiFi, StreamSets, Apache Pig, … Apache ORC • was created by Facebook and Hortonworks • Support in Hadoop, Hive, Spark, Apache NiFi, Apache Pig, Presto, … A B C A1 B1 C1 A2 B2 C2 A2 B2 C2 A1 B1 C1 A2 B2 C2 A3 B3 C3 A1 A2 A3 B1 B2 B3 C1 C2 C3 40
  • 41. Parquet and Big Data • Avro is widely supported by Big Data Frameworks: Hadoop MapReduce, Pig, Hive, Sqoop, Apache Spark, … • Spark Avro DataSource for Apache Spark supports using Avro as a source for DataFrames: https://github.com/databricks/spark-avro import com.databricks.spark.avro._ val personDF = spark.read.avro("person-v1.avro") personDF.createOrReplaceTempView("Person") val subPersonDF = spark.sql("select * from Person where firstName like 'G%'") libraryDependencies += ”org.apache.spark" %% "spark-avro" % ”2.12:3.0.1" 41
  • 42. Delta Lake - http://delta.io • Delta Lake is an open source storage layer that brings reliability to data lakes • First part of Databricks Platform, now open-sourced • Delta Lake provides • Fully compatible with Apache Spark • ACID transactions • Update and Delete on Big Data Storage • Schema enforcement • Time Travel (Data versioning) • Scalable metadata handling • Open Format (Parquet) • Unified streaming and batch data processing • Schema Evolution • Audit History • Integration with Presto/Athena/Hive/Amazon Redshift/Snowflake for read 42
  • 43. Delta Lake - http://delta.io 43
  • 44. Other Delta Lake Storage Layers Apache Hudi • https://hudi.apache.org/ • Ingests & manages storage of large analytical datasets over DFS Apache Iceberg • https://iceberg.apache.org • Open table format for huge analytic datasets • Adds tables to Presto and Spark that use a high-performance format 44 https://medium.com/@domisj/comparison-of-big-data-storage-layers-delta-vs-apache-hudi-vs-apache-iceberg-part-1-200599645a02
  • 46. Protobuf and gRPC • https://grpc.io/ • Google's high performance, open-source universal RPC framework • layering on top of HTTP/2 and using protocol buffers to define messages • Support for Java, C#, C++, Python, Go, Ruby, Node.js, Objective-C, … Source: https://thenewstack.io 46
  • 48. Serialization / Deserialization Service / Client Logic Event Broker Publish- Subscribe Data Lake Service { } API Logic REST Parallel ProcessingStorage Raw serialize deserialize serializedeserialize serialize deserialize deserialize serialize Storage Refined Integration Data Flow serialize serializedeserialize Stream Analyticsdeserialize serialize ResultsStream Analytics Streaming Source serialize 48
  • 49. You are welcome to join us at the Expo area. We're looking forward to meeting you. Link to the Expo area: https://www.vinivia-event- manager.io/e/DOAG/portal/expo/29731
  • 50. 51