SlideShare une entreprise Scribd logo
1  sur  22
Introduction
Introduction to Avro and Integration with
Hadoop
What is Avro?
• Avro is a serialization framework developed within Apache's Hadoop
project. It uses JSON for defining data types and protocols, and
serializes data in a compact binary format. Its primary use is in Apache
Hadoop, where it can provide both a serialization format for persistent
data.
• Avro provides good way to convert unstructured and semi-structured
data into a structured way using schemas
Creating your first Avro schema
Schema description:
{
"name": "User",
"type": "record",
"fields": [
{"name": "FirstName", "type": "string", "doc": "First Name"},
{"name": "LastName", "type": "string"},
{"name": "isActive", "type": "boolean", "default": true},
{"name": "Account", "type": "int", "default": 0} ]
}
Avro schema features
1. Primitive types (null, boolean, int, long, float, double, bytes, string)
2. Records
{ "type": "record",
"name": "LongList",
[ {"name": "value", "type": "long"},
{"name": ”description", "type”:”string”}]
}
3. Others (Enums, Arrays, Maps, Unions, Fixed)
Avro schema features
1. Primitive types (null, boolean, int, long, float, double, bytes, string)
2. Records
{ "type": "record",
"name": "LongList",
[ {"name": "value", "type": "long"},
{"name": ”description", "type”:”string”}]
}
3. Others (Enums, Arrays, Maps, Unions, Fixed)
How to create Avro record?
String schemaDescription = " { n"
+ " "name": "User", n"
+ " "type": "record",n" + " "fields": [n"
+ " {"name": "FirstName", "type": "string", "doc": "First Name"},n"
+ " {"name": "LastName", "type": "string"},n"
+ " {"name": "isActive", "type": "boolean", "default": true},n"
+ " {"name": "Account", "type": "int", "default": 0} ]n" + "}";
Schema.Parser parser = new Schema.Parser();
Schema s = parser.parse(schemaDescription);
GenericRecordBuilder builder = new GenericRecordBuilder(s);
How to create Avro record? (cont. 2)
1. The first step to create Avro record is to create JSON-based schema
2. Avro provides parser that will take a Avro schema string and return schema object.
3. Once the schema object is created, we have created a builder that will allow us to create
records with default values
How to create Avro record? (cont. 3)
GenericRecord r = builder.build();
System.out.println("Record" + r);
r.put("FirstName", "Joe");
r.put("LastName", "Hadoop");
r.put("Account", 12345);
System.out.println("Record" + r);
System.out.println("FirstName:" + r.get("FirstName"));
{"FirstName": null, "LastName": null, "isActive": true, "Account": 0}
{"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345}
FirstName:Joe
How to create Avro record? (cont. 3)
GenericRecord r = builder.build();
System.out.println("Record" + r);
r.put("FirstName", "Joe");
r.put("LastName", "Hadoop");
r.put("Account", 12345);
System.out.println("Record" + r);
System.out.println("FirstName:" + r.get("FirstName"));
{"FirstName": null, "LastName": null, "isActive": true, "Account": 0}
{"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345}
FirstName:Joe
How to create Avro schema dynamically?
String[] fields = {"FirstName", "LastName", "Account"};
Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false);
List<Schema.Field> lstFields = new LinkedList<Schema.Field>();
for (String f : fields) {
lstFields.add(new Schema.Field(f,
Schema.create(Schema.Type.STRING),
"doc",
new TextNode("")));
}
s.setFields(lstFields);
How to create Avro schema dynamically?
String[] fields = {"FirstName", "LastName", "Account"};
Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false);
List<Schema.Field> lstFields = new LinkedList<Schema.Field>();
for (String f : fields) {
lstFields.add(new Schema.Field(f,
Schema.create(Schema.Type.STRING),
"doc",
new TextNode("")));
}
s.setFields(lstFields);
How to sort Avro records?
You can also specify the which field you would like to order on and in which order:
Options: ascending, descending, ignore
{
"name" : "isActive",
"type" : "boolean",
"default" : true,
"order" : "ignore"
}, {
"name" : "Account",
"type" : "int",
"default" : 0,
"order" : "descending"
}
How to sort Avro records?
You can also specify the which field you would like to order on and in which order:
Options: ascending, descending, ignore
{
"name" : "isActive",
"type" : "boolean",
"default" : true,
"order" : "ignore"
}, {
"name" : "Account",
"type" : "int",
"default" : 0,
"order" : "descending"
}
How to write Avro records in a file?
File file = new File(“<file-name>");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
for (Record rec : list) {
dataFileWriter.append(rec);
}
dataFileWriter.close();
How to reading Avro records from a file?
File file = new File(“<file-name>");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
for (Record rec : list) {
dataFileWriter.append(rec);
}
dataFileWriter.close();
How to read Avro records from a file?
File file = new File(“<file-name>");
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader
= new DataFileReader<GenericRecord>(file, reader);
while (dataFileReader.hasNext()) {
Record r = (Record) dataFileReader.next();
System.out.println(r.toString());
}
Running MapReduce Jobs on Avro Data
1. Set input schema on AvroJob based on the schema from input path
File file = new File(DATA_PATH);
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
Schema s = dataFileReader.getSchema();
AvroJob.setInputSchema(job, s);
Running MapReduce Jobs on Avro Data (cont. 2)
1. Set input schema on AvroJob based on the schema from input path
File file = new File(DATA_PATH);
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
Schema s = dataFileReader.getSchema();
AvroJob.setInputSchema(job, s);
Running MapReduce Jobs on Avro Data - Mapper
public static class MapImpl extends
AvroMapper<GenericRecord, Pair<String, GenericRecord>> {
public void map( GenericRecord datum,
AvroCollector<Pair<String, GenericRecord>> collector,
Reporter reporter)
throws IOException {
….
}
}
Running MapReduce Jobs on Avro Data - Reducer
public static class ReduceImpl extends
AvroReducer<Utf8, GenericRecord, GenericRecord> {
public void reduce(Utf8 key, Iterable<GenericRecord> values,
AvroCollector< GenericRecord> collector,
Reporter reporter) throws IOException {
collector.collect(values.iterator().next());
return;
}
}
Running Avro MapReduce Jobs on Data with Different schema
List<Schema> schemas= new ArrayList<Schema>();
schemas.add(schema1);
schemas.add(schema2);
Schema schema3=Schema.createUnion(schemas);
This will allow to read data from different sources and process
both of them in the same mapper
Summary
• Avro is a great tool to use for semi-structured and structured data
• Simplifies MapReduce development
• Provides good compression mechanism
• Great tool for conversion from existing SQL code
• Questions?

Contenu connexe

Tendances

HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 

Tendances (20)

Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
Avro
AvroAvro
Avro
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 

Similaire à Avro introduction

How to write rust instead of c and get away with it
How to write rust instead of c and get away with itHow to write rust instead of c and get away with it
How to write rust instead of c and get away with it
Flavien Raynaud
 
PHP and MySQL with snapshots
 PHP and MySQL with snapshots PHP and MySQL with snapshots
PHP and MySQL with snapshots
richambra
 
Change the code in Writer.java only to get it working. Must contain .pdf
Change the code in Writer.java only to get it working. Must contain .pdfChange the code in Writer.java only to get it working. Must contain .pdf
Change the code in Writer.java only to get it working. Must contain .pdf
secunderbadtirumalgi
 
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
confluent
 

Similaire à Avro introduction (20)

Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
 
How to write rust instead of c and get away with it
How to write rust instead of c and get away with itHow to write rust instead of c and get away with it
How to write rust instead of c and get away with it
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
PHP and MySQL with snapshots
 PHP and MySQL with snapshots PHP and MySQL with snapshots
PHP and MySQL with snapshots
 
5java Io
5java Io5java Io
5java Io
 
Java Input Output and File Handling
Java Input Output and File HandlingJava Input Output and File Handling
Java Input Output and File Handling
 
File Handling in Java.pdf
File Handling in Java.pdfFile Handling in Java.pdf
File Handling in Java.pdf
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Input/Output Exploring java.io
Input/Output Exploring java.ioInput/Output Exploring java.io
Input/Output Exploring java.io
 
Xlab #1: Advantages of functional programming in Java 8
Xlab #1: Advantages of functional programming in Java 8Xlab #1: Advantages of functional programming in Java 8
Xlab #1: Advantages of functional programming in Java 8
 
Change the code in Writer.java only to get it working. Must contain .pdf
Change the code in Writer.java only to get it working. Must contain .pdfChange the code in Writer.java only to get it working. Must contain .pdf
Change the code in Writer.java only to get it working. Must contain .pdf
 
Strongly Typed Languages and Flexible Schemas
Strongly Typed Languages and Flexible SchemasStrongly Typed Languages and Flexible Schemas
Strongly Typed Languages and Flexible Schemas
 
Webinar: Strongly Typed Languages and Flexible Schemas
Webinar: Strongly Typed Languages and Flexible SchemasWebinar: Strongly Typed Languages and Flexible Schemas
Webinar: Strongly Typed Languages and Flexible Schemas
 
Streaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleStreaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScale
 
Best of build 2021 - C# 10 & .NET 6
Best of build 2021 -  C# 10 & .NET 6Best of build 2021 -  C# 10 & .NET 6
Best of build 2021 - C# 10 & .NET 6
 
Jug java7
Jug java7Jug java7
Jug java7
 
The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189
 
Let's talk about NoSQL Standard
Let's talk about NoSQL StandardLet's talk about NoSQL Standard
Let's talk about NoSQL Standard
 
Let's talk about NoSQL Standard
Let's talk about NoSQL StandardLet's talk about NoSQL Standard
Let's talk about NoSQL Standard
 
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
Kafka Summit SF 2017 - Efficient Schemas in Motion with Kafka and Schema Regi...
 

Avro introduction

  • 1. Introduction Introduction to Avro and Integration with Hadoop
  • 2. What is Avro? • Avro is a serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data. • Avro provides good way to convert unstructured and semi-structured data into a structured way using schemas
  • 3. Creating your first Avro schema Schema description: { "name": "User", "type": "record", "fields": [ {"name": "FirstName", "type": "string", "doc": "First Name"}, {"name": "LastName", "type": "string"}, {"name": "isActive", "type": "boolean", "default": true}, {"name": "Account", "type": "int", "default": 0} ] }
  • 4. Avro schema features 1. Primitive types (null, boolean, int, long, float, double, bytes, string) 2. Records { "type": "record", "name": "LongList", [ {"name": "value", "type": "long"}, {"name": ”description", "type”:”string”}] } 3. Others (Enums, Arrays, Maps, Unions, Fixed)
  • 5. Avro schema features 1. Primitive types (null, boolean, int, long, float, double, bytes, string) 2. Records { "type": "record", "name": "LongList", [ {"name": "value", "type": "long"}, {"name": ”description", "type”:”string”}] } 3. Others (Enums, Arrays, Maps, Unions, Fixed)
  • 6. How to create Avro record? String schemaDescription = " { n" + " "name": "User", n" + " "type": "record",n" + " "fields": [n" + " {"name": "FirstName", "type": "string", "doc": "First Name"},n" + " {"name": "LastName", "type": "string"},n" + " {"name": "isActive", "type": "boolean", "default": true},n" + " {"name": "Account", "type": "int", "default": 0} ]n" + "}"; Schema.Parser parser = new Schema.Parser(); Schema s = parser.parse(schemaDescription); GenericRecordBuilder builder = new GenericRecordBuilder(s);
  • 7. How to create Avro record? (cont. 2) 1. The first step to create Avro record is to create JSON-based schema 2. Avro provides parser that will take a Avro schema string and return schema object. 3. Once the schema object is created, we have created a builder that will allow us to create records with default values
  • 8. How to create Avro record? (cont. 3) GenericRecord r = builder.build(); System.out.println("Record" + r); r.put("FirstName", "Joe"); r.put("LastName", "Hadoop"); r.put("Account", 12345); System.out.println("Record" + r); System.out.println("FirstName:" + r.get("FirstName")); {"FirstName": null, "LastName": null, "isActive": true, "Account": 0} {"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345} FirstName:Joe
  • 9. How to create Avro record? (cont. 3) GenericRecord r = builder.build(); System.out.println("Record" + r); r.put("FirstName", "Joe"); r.put("LastName", "Hadoop"); r.put("Account", 12345); System.out.println("Record" + r); System.out.println("FirstName:" + r.get("FirstName")); {"FirstName": null, "LastName": null, "isActive": true, "Account": 0} {"FirstName": "Joe", "LastName": "Hadoop", "isActive": true, "Account": 12345} FirstName:Joe
  • 10. How to create Avro schema dynamically? String[] fields = {"FirstName", "LastName", "Account"}; Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false); List<Schema.Field> lstFields = new LinkedList<Schema.Field>(); for (String f : fields) { lstFields.add(new Schema.Field(f, Schema.create(Schema.Type.STRING), "doc", new TextNode(""))); } s.setFields(lstFields);
  • 11. How to create Avro schema dynamically? String[] fields = {"FirstName", "LastName", "Account"}; Schema s = Schema.createRecord("Ex2", “desc", ”namespace", false); List<Schema.Field> lstFields = new LinkedList<Schema.Field>(); for (String f : fields) { lstFields.add(new Schema.Field(f, Schema.create(Schema.Type.STRING), "doc", new TextNode(""))); } s.setFields(lstFields);
  • 12. How to sort Avro records? You can also specify the which field you would like to order on and in which order: Options: ascending, descending, ignore { "name" : "isActive", "type" : "boolean", "default" : true, "order" : "ignore" }, { "name" : "Account", "type" : "int", "default" : 0, "order" : "descending" }
  • 13. How to sort Avro records? You can also specify the which field you would like to order on and in which order: Options: ascending, descending, ignore { "name" : "isActive", "type" : "boolean", "default" : true, "order" : "ignore" }, { "name" : "Account", "type" : "int", "default" : 0, "order" : "descending" }
  • 14. How to write Avro records in a file? File file = new File(“<file-name>"); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer); dataFileWriter.create(schema, file); for (Record rec : list) { dataFileWriter.append(rec); } dataFileWriter.close();
  • 15. How to reading Avro records from a file? File file = new File(“<file-name>"); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer); dataFileWriter.create(schema, file); for (Record rec : list) { dataFileWriter.append(rec); } dataFileWriter.close();
  • 16. How to read Avro records from a file? File file = new File(“<file-name>"); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); while (dataFileReader.hasNext()) { Record r = (Record) dataFileReader.next(); System.out.println(r.toString()); }
  • 17. Running MapReduce Jobs on Avro Data 1. Set input schema on AvroJob based on the schema from input path File file = new File(DATA_PATH); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); Schema s = dataFileReader.getSchema(); AvroJob.setInputSchema(job, s);
  • 18. Running MapReduce Jobs on Avro Data (cont. 2) 1. Set input schema on AvroJob based on the schema from input path File file = new File(DATA_PATH); DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); Schema s = dataFileReader.getSchema(); AvroJob.setInputSchema(job, s);
  • 19. Running MapReduce Jobs on Avro Data - Mapper public static class MapImpl extends AvroMapper<GenericRecord, Pair<String, GenericRecord>> { public void map( GenericRecord datum, AvroCollector<Pair<String, GenericRecord>> collector, Reporter reporter) throws IOException { …. } }
  • 20. Running MapReduce Jobs on Avro Data - Reducer public static class ReduceImpl extends AvroReducer<Utf8, GenericRecord, GenericRecord> { public void reduce(Utf8 key, Iterable<GenericRecord> values, AvroCollector< GenericRecord> collector, Reporter reporter) throws IOException { collector.collect(values.iterator().next()); return; } }
  • 21. Running Avro MapReduce Jobs on Data with Different schema List<Schema> schemas= new ArrayList<Schema>(); schemas.add(schema1); schemas.add(schema2); Schema schema3=Schema.createUnion(schemas); This will allow to read data from different sources and process both of them in the same mapper
  • 22. Summary • Avro is a great tool to use for semi-structured and structured data • Simplifies MapReduce development • Provides good compression mechanism • Great tool for conversion from existing SQL code • Questions?