SlideShare une entreprise Scribd logo
1  sur  44
Choosing an HDFS data storage format: Avro vs.
Parquet and more
Stephen O’Sullivan | @steveos
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• Prioritize for highest
business value when using
emerging technology
• Design with outcomes in
mind
• Be agile: deliver initial
results quickly, then adapt
and iterate
• Collaborate constantly with
our customers and partners
OUR PHILOSOPHY
4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
AGENDA
Introduction
Data formats
How to choose
Schema evolution
Summary
Questions
Introduction
Data formats
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• Storage formats
• What they do
DATA FORMATS
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Data format
• Storage Format
• Text
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar (ORC)
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Text
• More specifically text = csv, tsv, json records…
• Convenient format to use to exchange with other
applications or scripts that produce or read
delimited files
• Human readable and parsable
• Data stores is bulky and not as efficient to query
• Do not support block compression
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Sequence File
• Provides a persistent data structure for binary key-
value pairs
• Row based
• Commonly used to transfer data between Map
Reduce jobs
• Can be used as an archive to pack small files in
Hadoop
• Support splitting even when the data is
compressed
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Avro
• Widely used as a serialization platform
• Row-based, offers a compact and fast binary
format
• Schema is encoded on the file so the data can be
untagged
• Files support block compression and are splittable
• Supports schema evolution
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Parquet
• Column-oriented binary file format
• Uses the record shredding and assembly algorithm
described in the Dremel paper
• Each data file contains the values for a set of rows
• Efficient in terms of disk I/O when specific columns
need to be queried
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Optimized Row Columnar
• Considered the evolution of the RCFile
• Stores collections of rows and within the collection
the row data is stored in columnar format
• Introduces a lightweight indexing that enables
skipping of irrelevant blocks of rows
• Splittable: allows parallel processing of row
collections
• It comes with basic statistics on columns (min ,max,
sum, and count)
How to choose
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• ..for write
• ..for read
HOW TO CHOOSE
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Functional Requirements:
• What type of data do you have?
• Is the data format compatible with your
processing and querying tools?
• What are your file sizes?
• Do you have schemas that evolve over time?
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Speed Concerns
• Parquet and ORC usually needs some
additional parsing to format the data which
increases the overall read time
• Avro as a data serialization format: works well from
system to system, handles schema evolution (more
on that later)
• Text is bulky and inefficient but easily readable and
parsable
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
20
40
60
80
100
120
140
160
TimeinSeconds
Narrow – Hortonworks (Hive 0.14 )
0
500
1000
1500
2000
2500
TimeinSeconds
Wide – Hortonworks (Hive 0.14)
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
10
20
30
40
50
60
70
TimeinSeconds
Narrow - hive-1.1.0+cdh5.4.2
0
100
200
300
400
500
600
700
TimeinSeconds
Wide - hive-1.1.0+cdh5.4.2
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
0
10
20
30
40
50
60
70
Text Avro Parquet
TimeinSeconds
Narrow - Spark 1.3
0
200
400
600
800
1000
1200
Text Avro Parquet
TimeinSeconds
Wide - Spark 1.3
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
200
400
600
800
1000
1200
1400
Megabytes
File sizes for narrow dataset
0
2000
4000
6000
8000
10000
12000
Megabytes
File sizes for wide dataset
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Use case
• Avro – Event data that can change over time
• Sequence File – Datasets shared between MR
jobs
• Text – Adding large amounts of data to HDFS
quickly
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Types of queries:
• Column specific queries, or few groups of
columns -> Use columnar format like Parquet or
ORC
• Compression of the file regardless the format
increases query speed times
• Text is really slow to read
• Parquet and ORC optimize read performance at
the expense of write performance
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Set up:
• Narrow dataset:
• 10 million rows, 10 columns
• Wide dataset:
• 4 million rows, 1000 columns
• Compression:
• Snappy, except for Avro which is deflate
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
10
20
30
40
50
60
70
Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - Hortonworks Hive 0.14.0.2.2.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
100
200
300
400
500
600
700
800
Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions)
TimeinSeconds
Wide Dataset - Hortonworks Hive 0.14.0.2.2.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
10
20
30
40
50
60
70
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - CDH hive-1.1.0+cdh5.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
50
100
150
200
250
Query 1 (no
conditions)
Query 2 (5
conditions)
Query 3 (10
conditions)
Query 4 (20
conditions)
TimeinSeconds
Wide Dataset - CDH hive-1.1.0+cdh5.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
1
2
3
4
5
6
7
8
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - CDH Impala
Text
Avro
Parquet
Sequence
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
5
10
15
20
25
30
Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters)
TimeinSeconds
Wide Dataset - CDH Impala
Text
Avro
Parquet
Sequence
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
Ran 4 queries (using Impala)
over 4 Million rows (70GB raw),
and 1000 columns (wide table)
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
Query 1 (0
filters)
Query 2 (5
filters)
Query 3 (10
filters)
Query 4 (20
filters)
Seconds
Query times for different data formats
Avro uncompress
Avro Snappy
Avro Deflate
Parquet
Seq uncompressed
Seq Snappy
Text Snappy
Text uncompressed
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Use case
• Avro – Query datasets that have changed over
time
• Parquet – Query a few columns on a wide
table
Schema Evolution
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• What is schema
evolution?
• Data formats that evolve
• Examples
• Use cases
SCHEMA
EVOLUTION
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• What is schema evolution?
• Adding columns
• Renaming columns
• Removing columns
• Why do we need it?
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Data formats that can evolve
• Avro
• Parquet
• Can only add columns at the end
• ORC
• It’s coming (That’s what they tell me ;) )…
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• The data – Dr Who episodes
• Original Dr Who & new Dr Who
• http://www.theguardian.com/news/datablog/2010/aug/20/doctor-who-time-travel-
information-is-beautiful
• Avro schema for the original Dr Who
{"namespace": "drwho.avro",
"type": "record",
"name": "drwho",
"fields": [
{"name": "doctor_who_season", "type": "string"}, {"name": "doctor_actor", "type": "string"},
{"name": "episode_no", "type": "string"}, {"name": "episode_title", "type": "string"},
{"name": "date_from", "type": "string"}, {"name": "date_to", "type": "string"},
{"name": "estimated", "type": "string"}, {"name": "planet", "type": "string"},
{"name": "sub_location", "type": "string"}, {"name": "main_location", "type": "string"}
]}
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Original Dr Who data
doctor_who_
season doctor_actor episode_no episode_title date_from date_to estimated planet sub_location main_location
3 Pertwee 51 Spearhead from Space 1970 1990 y Earth England London and other
3 Pertwee 55 Terror of the Autons 1971 1971 y Earth England Luigi Rossini's Circus
3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus
3 Pertwee 59 The Daemons 1971 1971 y Earth England Devil's End; Wiltshire
3 Pertwee 60 Day of the Daleks 1972 2100 Earth England
Auderly House and
environs
3 Pertwee 63 The Mutants 1972 2900 Solos
3 Pertwee 64 The Time Monster -2000 1972 Earth/ Atlantis
3 Pertwee 64 The Time Monster 1972 -2000 Earth/ Atlantis
3 Pertwee 66 Carnival of Monsters 1972 1928 n
Indian Ocean;
Planet Inter Minor Ocean; alien planet
3 Pertwee 67 Frontier in Space 1972 2540 n
Planet Draconia;
Orgon Planet alien planets
3 Pertwee 68 Planet of the Daleks 1972 2540 y Planet Spiridon Alien Planet
3 Pertwee 69 The Green Death 2540 1973 y Earth UK Llanfairfach; Wales
3 Pertwee 70 The Time Warrior 1973 1200 n Earth UK Wessex Castle
3 Pertwee 71
Invasion of The
Dinosaurs 1200 1974 y Earth UK London
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Lets add, rename, and delete some columns
• Avro schema for the new Dr Who
{"namespace": "drwho.avro",
"type": "record",
"name": "drwho",
"fields": [
{"name": "drwho_season", "type": ["null","string"], "aliases": ["doctor_who_season"]},
{"name": "drwho_actor", "type": ["null","string"], "aliases": ["doctor_actor"]},
{"name": "episode_no", "type": ["null","string"]}, {"name": "episode_title", "type": ["null","string"]},
{"name": "date_from", "type": ["null","string"]}, {"name": "date_to", "type": ["null","string"]},
{"name": "estimated", "type": "string"}, {"name": "planet", "type": ["null","string"]},
{"name": "sub_location", "type": ["null","string"]}, {"name": "main_location", "type": ["null","string"]},
{"name": "hd", "type": "string", "default": "no"}
]}
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Original & New Dr Who data
drwho_season drwho_actor episode_no episode_title date_from date_to planet sub_location main_location hd
10 Tennant 201 New Earth 2006 5000000023 New Earth New … New York yes
10 Tennant 202 Tooth and claw 2006 1879 Earth Scotland
Torchwood house;
Near Balmoral yes
10 Tennant 203 school Reunion 2007 2007 Earth England Deffry Vale yes
10 Tennant 204
the Girl in the
Fireplace 1727 1744 Earth France Paris yes
3 Pertwee 51
Spearhead from
Space 1970 1990 Earth England London and other no
3 Pertwee 55
Terror of the
Autons 1971 1971 Earth England Luigi Rossini's Circus no
3 Pertwee 58 Colony in Space 1971 2472
planet
Uxarieus no
3 Pertwee 59 The Daemons 1971 1971 Earth England Devil's End; Wiltshire no
3 Pertwee 60 Day of the Daleks 1972 2100 Earth England
Auderly House and
environs no
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Use cases
• New data added to an event stream
• Need to see historic data with new data (and
the schema has changed a lot)
• Business has changed the field/column name
Summary
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
QUESTIONS?
44
Yes, we’re hiring!
info@svds.com
THANK YOU
Stephen O’Sullivan
stephen@svds.com
@steveos
Demo code is here:
github.com/silicon-valley-data-
science/stampedecon-2015

Contenu connexe

Tendances

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 

Tendances (20)

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 

En vedette

ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
 

En vedette (18)

Hadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHAHadoop and Data Virtualization - A Case Study by VHA
Hadoop and Data Virtualization - A Case Study by VHA
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance Evaluation
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
HBaseCon 2013: A Developer’s Guide to Coprocessors
HBaseCon 2013: A Developer’s Guide to CoprocessorsHBaseCon 2013: A Developer’s Guide to Coprocessors
HBaseCon 2013: A Developer’s Guide to Coprocessors
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点 大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 

Similaire à Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015

Similaire à Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015 (20)

Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
HDP Next: Governance
HDP Next: GovernanceHDP Next: Governance
HDP Next: Governance
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Unlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQLUnlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQL
 
Mysql using php
Mysql using phpMysql using php
Mysql using php
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Data Science
Data ScienceData Science
Data Science
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Michael Hausenblas- Scalable time series and stream processing for IoT applic...
Michael Hausenblas- Scalable time series and stream processing for IoT applic...Michael Hausenblas- Scalable time series and stream processing for IoT applic...
Michael Hausenblas- Scalable time series and stream processing for IoT applic...
 
Colin Carter - LSPs and APIs
Colin Carter  - LSPs and APIsColin Carter  - LSPs and APIs
Colin Carter - LSPs and APIs
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
 
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minsSparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
 

Plus de StampedeCon

Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 

Plus de StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015

  • 1. Choosing an HDFS data storage format: Avro vs. Parquet and more Stephen O’Sullivan | @steveos
  • 2. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience
  • 3. 3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • Prioritize for highest business value when using emerging technology • Design with outcomes in mind • Be agile: deliver initial results quickly, then adapt and iterate • Collaborate constantly with our customers and partners OUR PHILOSOPHY
  • 4. 4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience AGENDA Introduction Data formats How to choose Schema evolution Summary Questions
  • 7. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • Storage formats • What they do DATA FORMATS
  • 8. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Data format • Storage Format • Text • Sequence File • Avro • Parquet • Optimized Row Columnar (ORC)
  • 9. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Text • More specifically text = csv, tsv, json records… • Convenient format to use to exchange with other applications or scripts that produce or read delimited files • Human readable and parsable • Data stores is bulky and not as efficient to query • Do not support block compression
  • 10. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Sequence File • Provides a persistent data structure for binary key- value pairs • Row based • Commonly used to transfer data between Map Reduce jobs • Can be used as an archive to pack small files in Hadoop • Support splitting even when the data is compressed
  • 11. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Avro • Widely used as a serialization platform • Row-based, offers a compact and fast binary format • Schema is encoded on the file so the data can be untagged • Files support block compression and are splittable • Supports schema evolution
  • 12. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Parquet • Column-oriented binary file format • Uses the record shredding and assembly algorithm described in the Dremel paper • Each data file contains the values for a set of rows • Efficient in terms of disk I/O when specific columns need to be queried
  • 13. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Optimized Row Columnar • Considered the evolution of the RCFile • Stores collections of rows and within the collection the row data is stored in columnar format • Introduces a lightweight indexing that enables skipping of irrelevant blocks of rows • Splittable: allows parallel processing of row collections • It comes with basic statistics on columns (min ,max, sum, and count)
  • 15. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • ..for write • ..for read HOW TO CHOOSE
  • 16. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Functional Requirements: • What type of data do you have? • Is the data format compatible with your processing and querying tools? • What are your file sizes? • Do you have schemas that evolve over time?
  • 17. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Speed Concerns • Parquet and ORC usually needs some additional parsing to format the data which increases the overall read time • Avro as a data serialization format: works well from system to system, handles schema evolution (more on that later) • Text is bulky and inefficient but easily readable and parsable
  • 18. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 20 40 60 80 100 120 140 160 TimeinSeconds Narrow – Hortonworks (Hive 0.14 ) 0 500 1000 1500 2000 2500 TimeinSeconds Wide – Hortonworks (Hive 0.14) Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns
  • 19. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 10 20 30 40 50 60 70 TimeinSeconds Narrow - hive-1.1.0+cdh5.4.2 0 100 200 300 400 500 600 700 TimeinSeconds Wide - hive-1.1.0+cdh5.4.2 Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns
  • 20. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns 0 10 20 30 40 50 60 70 Text Avro Parquet TimeinSeconds Narrow - Spark 1.3 0 200 400 600 800 1000 1200 Text Avro Parquet TimeinSeconds Wide - Spark 1.3
  • 21. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 200 400 600 800 1000 1200 1400 Megabytes File sizes for narrow dataset 0 2000 4000 6000 8000 10000 12000 Megabytes File sizes for wide dataset
  • 22. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Use case • Avro – Event data that can change over time • Sequence File – Datasets shared between MR jobs • Text – Adding large amounts of data to HDFS quickly
  • 23. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Types of queries: • Column specific queries, or few groups of columns -> Use columnar format like Parquet or ORC • Compression of the file regardless the format increases query speed times • Text is really slow to read • Parquet and ORC optimize read performance at the expense of write performance
  • 24. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Set up: • Narrow dataset: • 10 million rows, 10 columns • Wide dataset: • 4 million rows, 1000 columns • Compression: • Snappy, except for Avro which is deflate
  • 25. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 10 20 30 40 50 60 70 Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - Hortonworks Hive 0.14.0.2.2.4.2 Text Avro Parquet Sequence ORC
  • 26. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 100 200 300 400 500 600 700 800 Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions) TimeinSeconds Wide Dataset - Hortonworks Hive 0.14.0.2.2.4.2 Text Avro Parquet Sequence ORC
  • 27. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 10 20 30 40 50 60 70 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - CDH hive-1.1.0+cdh5.4.2 Text Avro Parquet Sequence ORC
  • 28. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 50 100 150 200 250 Query 1 (no conditions) Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions) TimeinSeconds Wide Dataset - CDH hive-1.1.0+cdh5.4.2 Text Avro Parquet Sequence ORC
  • 29. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 1 2 3 4 5 6 7 8 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - CDH Impala Text Avro Parquet Sequence
  • 30. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 5 10 15 20 25 30 Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters) TimeinSeconds Wide Dataset - CDH Impala Text Avro Parquet Sequence
  • 31. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read Ran 4 queries (using Impala) over 4 Million rows (70GB raw), and 1000 columns (wide table) 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters) Seconds Query times for different data formats Avro uncompress Avro Snappy Avro Deflate Parquet Seq uncompressed Seq Snappy Text Snappy Text uncompressed
  • 32. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Use case • Avro – Query datasets that have changed over time • Parquet – Query a few columns on a wide table
  • 34. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • What is schema evolution? • Data formats that evolve • Examples • Use cases SCHEMA EVOLUTION
  • 35. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • What is schema evolution? • Adding columns • Renaming columns • Removing columns • Why do we need it?
  • 36. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Data formats that can evolve • Avro • Parquet • Can only add columns at the end • ORC • It’s coming (That’s what they tell me ;) )…
  • 37. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • The data – Dr Who episodes • Original Dr Who & new Dr Who • http://www.theguardian.com/news/datablog/2010/aug/20/doctor-who-time-travel- information-is-beautiful • Avro schema for the original Dr Who {"namespace": "drwho.avro", "type": "record", "name": "drwho", "fields": [ {"name": "doctor_who_season", "type": "string"}, {"name": "doctor_actor", "type": "string"}, {"name": "episode_no", "type": "string"}, {"name": "episode_title", "type": "string"}, {"name": "date_from", "type": "string"}, {"name": "date_to", "type": "string"}, {"name": "estimated", "type": "string"}, {"name": "planet", "type": "string"}, {"name": "sub_location", "type": "string"}, {"name": "main_location", "type": "string"} ]}
  • 38. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Original Dr Who data doctor_who_ season doctor_actor episode_no episode_title date_from date_to estimated planet sub_location main_location 3 Pertwee 51 Spearhead from Space 1970 1990 y Earth England London and other 3 Pertwee 55 Terror of the Autons 1971 1971 y Earth England Luigi Rossini's Circus 3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus 3 Pertwee 59 The Daemons 1971 1971 y Earth England Devil's End; Wiltshire 3 Pertwee 60 Day of the Daleks 1972 2100 Earth England Auderly House and environs 3 Pertwee 63 The Mutants 1972 2900 Solos 3 Pertwee 64 The Time Monster -2000 1972 Earth/ Atlantis 3 Pertwee 64 The Time Monster 1972 -2000 Earth/ Atlantis 3 Pertwee 66 Carnival of Monsters 1972 1928 n Indian Ocean; Planet Inter Minor Ocean; alien planet 3 Pertwee 67 Frontier in Space 1972 2540 n Planet Draconia; Orgon Planet alien planets 3 Pertwee 68 Planet of the Daleks 1972 2540 y Planet Spiridon Alien Planet 3 Pertwee 69 The Green Death 2540 1973 y Earth UK Llanfairfach; Wales 3 Pertwee 70 The Time Warrior 1973 1200 n Earth UK Wessex Castle 3 Pertwee 71 Invasion of The Dinosaurs 1200 1974 y Earth UK London
  • 39. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Lets add, rename, and delete some columns • Avro schema for the new Dr Who {"namespace": "drwho.avro", "type": "record", "name": "drwho", "fields": [ {"name": "drwho_season", "type": ["null","string"], "aliases": ["doctor_who_season"]}, {"name": "drwho_actor", "type": ["null","string"], "aliases": ["doctor_actor"]}, {"name": "episode_no", "type": ["null","string"]}, {"name": "episode_title", "type": ["null","string"]}, {"name": "date_from", "type": ["null","string"]}, {"name": "date_to", "type": ["null","string"]}, {"name": "estimated", "type": "string"}, {"name": "planet", "type": ["null","string"]}, {"name": "sub_location", "type": ["null","string"]}, {"name": "main_location", "type": ["null","string"]}, {"name": "hd", "type": "string", "default": "no"} ]}
  • 40. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Original & New Dr Who data drwho_season drwho_actor episode_no episode_title date_from date_to planet sub_location main_location hd 10 Tennant 201 New Earth 2006 5000000023 New Earth New … New York yes 10 Tennant 202 Tooth and claw 2006 1879 Earth Scotland Torchwood house; Near Balmoral yes 10 Tennant 203 school Reunion 2007 2007 Earth England Deffry Vale yes 10 Tennant 204 the Girl in the Fireplace 1727 1744 Earth France Paris yes 3 Pertwee 51 Spearhead from Space 1970 1990 Earth England London and other no 3 Pertwee 55 Terror of the Autons 1971 1971 Earth England Luigi Rossini's Circus no 3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus no 3 Pertwee 59 The Daemons 1971 1971 Earth England Devil's End; Wiltshire no 3 Pertwee 60 Day of the Daleks 1972 2100 Earth England Auderly House and environs no
  • 41. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Use cases • New data added to an event stream • Need to see historic data with new data (and the schema has changed a lot) • Business has changed the field/column name
  • 43. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience QUESTIONS?
  • 44. 44 Yes, we’re hiring! info@svds.com THANK YOU Stephen O’Sullivan stephen@svds.com @steveos Demo code is here: github.com/silicon-valley-data- science/stampedecon-2015

Notes de l'éditeur

  1. Description You have your Hadoop cluster, and you are ready to fill it up with data, but wait: Which format should you use to store your data? Should you store it in Plain Text, Sequence File, Avro, or Parquet? (And should you compress it?) This talk will take a closer look at some of the trade-offs, and will cover the How, Why, and When of choosing one format over another.
  2. Do not support block compression Once they are compressed they are not splittable anymore increasing read performance cost
  3. Each data file contains the values for a set of rows Within a data file, the values from each column are organized so that they are adjacent, enabling good compression values
  4. No results query 1 (which is count no conditions). This is because stinger is has meta data about the amount of data in the table (only when it’s an internal table).