March 2012 HUG: JuteRC compiler

•

1 j'aime•1,180 vues

Yahoo’s data ETL pipeline continuously processes more than tens of terabytes of data every day. Seeking for a good data storage methodology that can store and fetch this data efficiently has always been a challenge for the Yahoo data ETL pipeline. A study done recently inside Yahoo has shown a dramatic data size reduction by switching from Sequence to RC File Format. We have decided to take the approach of converting our data to the RC File Format. The most challenging task is to manually serialize the data objects. We rely on Jute, a Hadoop Record Compiler, to provide serialization code. However, Jute does not support RC File Format. In addition, RC file format does not support native Hadoop writable objects. Therefore writing serialization code becomes complicated and repetitive. Hence, we invented the JuteRC compiler which is an extension to the Hadoop Record Compiler (Jute). It generates serialization/deserialization code for any user defined primitive or composite data types. MapReduce programmer can directly plug in the serialization/deserialization code to generate MapReduce output data file that is in RC File Storage Format. With the help of JuteRC compiler, our experiment against Yahoo audience data showed a 26-28% file size reduction and 40% read/write performance improvement compared to Sequence File. We are currently in the process to open source JuteRC.

Technologie

Hadoop Jute RC Compiler

Tanping Wang

Yahoo! User Data Analytics

Agenda

• How we use Jute compiler today
• JuteRC compiler
• By using JuteRC, what we have achieved

Yahoo! Presentation Template, Confidential 2 03/30/12

Hadoop Record Compiler (Jute)

• Generates serialization code to store data in
the Sequence File format
org.apache.hadoop.record

Yahoo! Presentation Template, Confidential 3 03/30/12

How Do We Use Jute Today

• Use Data Definition Language to define my
data type:
class MyDataType {
buffer myBuffer;
long myLong;
}
• Use Jute compiler to generate serialization
code:
$ rcc –language java mydatatype.jr

Yahoo! Presentation Template, Confidential 4 03/30/12

$Jute Generates Serialization Code for me public void serialize(final org.apache.hadoop.record.RecordOutput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(this,_rio_tag); _rio_a.writeBuffer(myBuffer,"myBuffer"); _rio_a.writeLong(myLong,"myLong"); _rio_a.endRecord(this,_rio_tag); } private void deserializeWithoutFilter(final org.apache.hadoop.record.RecordInput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(_rio_tag); myBuffer=_rio_a.readBuffer("myBuffer"); myLong=_rio_a.readLong("myLong"); _rio_a.endRecord(_rio_tag); } Yahoo! Presentation Template, Confidential 5 03/30/12$

Today Yahoo audience ETL pipeline processes
tens of terabytes of data per day.
We rely on Jute. We use Sequence File to
store our data.

Yahoo! Presentation Template, Confidential 6 03/30/12

However, We Want To Use RC Format.

Yahoo! Presentation Template, Confidential 7 03/30/12

RC File Format
• RCFile shares much similarity with Sequence File, but splits a file
into row groups. Inside each row group, it stores columns as rows.

• Similar data types are grouped together. This potentially brings
better compression rate.

Yahoo! Presentation Template, Confidential 8 03/30/12

Jute only supports Sequence File Format.
So We built JuteRC Compiler.

Yahoo! Presentation Template, Confidential 9 03/30/12

Data Type

Yahoo! Presentation Template, Confidental 10 03/30/12

Also…

• For each JType, overwrite genReadMethod
and genWriteMethod.
• Changed CodeGenerator in Jute.

Yahoo! Presentation Template, Confidential 11 03/30/12

$Serialization Code Generated by Jute v.s. JuteRC Jute: public void serialize(final org.apache.hadoop.record.RecordOutput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(this,_rio_tag); _rio_a.writeBuffer(myBuffer,"myBuffer"); _rio_a.writeLong(myLong,"myLong"); _rio_a.endRecord(this,_rio_tag); } JuteRC: public class MyDataType extends org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable { public void serialize() { int writeIndx = 0; try { com.yahoo.ccdi.fetl.RcUtil.writeBuffer(this, myBuffer, writeIndx++); com.yahoo.ccdi.fetl.RcUtil.writeLong(this, myLong, writeIndx++); } catch(java.io.IOException e) { } } Yahoo! Presentation Template, Confidential 12 03/30/12$

$Deserialization Code Jute: private void deserializeWithoutFilter(final org.apache.hadoop.record.RecordInput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(_rio_tag); myBuffer=_rio_a.readBuffer("myBuffer"); myLong=_rio_a.readLong("myLong"); _rio_a.endRecord(_rio_tag); } JuteRC: public void deserialize(org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable bra){ int readIndx = 0; try { myBuffer=com.yahoo.ccdi.fetl.RcUtil.readBuffer(bra, readIndx++); myLong=com.yahoo.ccdi.fetl.RcUtil.readLong(bra, readIndx++); }catch(java.io.IOException e) { }}} Yahoo! Presentation Template, Confidential 13 03/30/12$

Using RC
• Convert sequence file format file to RC format:
achieved 26~28% file size reduction.

• Faster IO performance: reading/writing 0.6X
• Process our data using both Hive and PIG on
top of HCatalog.

Yahoo! Presentation Template, Confidential 14 03/30/12

Open Source

• We are in the process to open source JuteRC.
Under review by Yahoo! Open Source
Working Group.
• MapReduce programmer can directly plug in
the code generated by JuteRC and store their
data in RC format.

Yahoo! Presentation Template, Confidential 15 03/30/12

References

• RCFile: A Fast and Space-efficient Data Placement Structure in
MapReduce-based Warehouse Systems
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-

• Hive RCFile:
http://hive.apache.org/docs/r0.7.0/api/org/apache/hadoop/hive/ql/io/RCFile.ht

Yahoo! Presentation Template, Confidential 16 03/30/12

Contenu connexe

Tendances

SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfsMariano Rodriguez-Muro

When big data meet python @ COSCUP 2012Jimmy Lai

Spring Data in 10 minutesCorneil du Plessis

Development of the irods rados plugin @ iRODS User group meeting 2014mgrawinkel

JahiaOne - Jahia 7, The External Data ProviderJahia Solutions Group

Pig - Analyzing data setsCreditas

Recommender.system.presentation.pjug.05.20.2014rpbrehm

Scientific data curation and processing with Apache TikaChris Mattmann

Week1 dbdmarisa kuntasup

Big Data Analytics using MahoutIMC Institute

Hands On Spring DataEric Bottard

Lucene InputFormat (lightning talk) - TriHUG December 10, 2013mumrah

HfileMarc de Palol

Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)Olaf Hartig

python and databaseKwangyoun Jung

An Introduction to Spring DataOliver Gierke

Clustering output of Apache Nutch using Apache SparkThamme Gowda

Getting triples from records: the role of ISBDScottish Library & Information Council (SLIC), CILIP in Scotland (CILIPS)

IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityThamme Gowda

HDF5 Advanced TopicsThe HDF-EOS Tools and Information Center

Tendances (20)

SWT Lecture Session 6 - RDFS semantics, inference techniques, sesame rdfs

When big data meet python @ COSCUP 2012

Spring Data in 10 minutes

Development of the irods rados plugin @ iRODS User group meeting 2014

JahiaOne - Jahia 7, The External Data Provider

Pig - Analyzing data sets

Recommender.system.presentation.pjug.05.20.2014

Scientific data curation and processing with Apache Tika

Week1 dbd

Big Data Analytics using Mahout

Hands On Spring Data

Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Hfile

Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)

python and database

An Introduction to Spring Data

Clustering output of Apache Nutch using Apache Spark

Getting triples from records: the role of ISBD

IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity

HDF5 Advanced Topics

En vedette

Tortilla de sobras de pucheroChoquera

Ebook 16palavras inglesElder Oliveira

História do Brasil - Boris FaustoAurelio Junior

No more Big Data Hacking—Time for a Complete ETL Solution with Oracle Data In...Jérôme Françoisse

неделя математики 8 группаyuyukul

группа №13 развивающие игры воскобовичаyuyukul

Matter Incidental to the Execution of the Will a_sophi

임산부 영양상담mothersafe

Leading Age 2015 Presentation Repositioning ROI Synergy in Design, Care, and...Christine Rancourt

Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake

En vedette (10)

Tortilla de sobras de puchero

Ebook 16palavras ingles

História do Brasil - Boris Fausto

No more Big Data Hacking—Time for a Complete ETL Solution with Oracle Data In...

неделя математики 8 группа

группа №13 развивающие игры воскобовича

Matter Incidental to the Execution of the Will

임산부 영양상담

Leading Age 2015 Presentation Repositioning ROI Synergy in Design, Care, and...

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

Similaire à March 2012 HUG: JuteRC compiler

Content Storage With Apache JackrabbitJukka Zitting

Hibernate complete Trainingsourabh aggarwal

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMYahoo!デベロッパーネットワーク

JavaScript Miller ColumnsJonathan Fine

dJangoBob Chao

Game Analytics at London Apache Druid MeetupJelena Zanko

Test02testingPdf

Distributed Tracingdistributedtracing

Java se7 featuresKumaraswamy M

Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.

Distributed tracing 101Itiel Shwartz

20100730 phpstudyYusuke Ando

jRecruiter - The AJUG Job Posting ServiceGunnar Hillert

.NET @ apache.orgTed Husted

Tools for A Preservation Ready WebMichael Nelson

07 response-headershanichandra

JBoss Architect Forum London - October 2013 - Platform as a What?JBossArchitectForum

HibernateAjay K

Joget Workflow v6 Training Slides - 20 - Basic System AdministrationJoget Workflow

Ruby for soul of BigData NerdsAbhishek Parolkar

Similaire à March 2012 HUG: JuteRC compiler (20)

Content Storage With Apache Jackrabbit

Hibernate complete Training

Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

JavaScript Miller Columns

dJango

Game Analytics at London Apache Druid Meetup

Test02

Distributed Tracing

Java se7 features

Big data, just an introduction to Hadoop and Scripting Languages

Distributed tracing 101

20100730 phpstudy

jRecruiter - The AJUG Job Posting Service

.NET @ apache.org

Tools for A Preservation Ready Web

07 response-headers

JBoss Architect Forum London - October 2013 - Platform as a What?

Hibernate

Joget Workflow v6 Training Slides - 20 - Basic System Administration

Ruby for soul of BigData Nerds

Plus de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

Plus de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Dernier

Rise of the Machines: Known As Drones...Rick Flair

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

How to write a Business Continuity PlanDatabarracks

Advanced Computer Architecture – An IntroductionDilum Bandara

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

A Journey Into the Emotions of Software DevelopersNicole Novielli

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

unit 4 immunoblotting technique complete.pptxBkGupta21

Artificial intelligence in cctv survelliance.pptxhariprasad279825

From Family Reminiscence to Scholarly Archive .Alan Dix

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Dernier (20)

Rise of the Machines: Known As Drones...

Unraveling Multimodality with Large Language Models.pdf

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

SIP trunking in Janus @ Kamailio World 2024

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

How to write a Business Continuity Plan

Advanced Computer Architecture – An Introduction

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

A Journey Into the Emotions of Software Developers

DSPy a system for AI to Write Prompts and Do Fine Tuning

"Debugging python applications inside k8s environment", Andrii Soldatenko

What's New in Teams Calling, Meetings and Devices March 2024

unit 4 immunoblotting technique complete.pptx

Artificial intelligence in cctv survelliance.pptx

From Family Reminiscence to Scholarly Archive .

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Generative AI for Technical Writer or Information Developers

Ensuring Technical Readiness For Copilot in Microsoft 365

The Ultimate Guide to Choosing WordPress Pros and Cons

March 2012 HUG: JuteRC compiler

1. Hadoop Jute RC Compiler Tanping Wang Yahoo! User Data Analytics

2. Agenda • How we use Jute compiler today • JuteRC compiler • By using JuteRC, what we have achieved Yahoo! Presentation Template, Confidential 2 03/30/12

3. Hadoop Record Compiler (Jute) • Generates serialization code to store data in the Sequence File format org.apache.hadoop.record Yahoo! Presentation Template, Confidential 3 03/30/12

4. How Do We Use Jute Today • Use Data Definition Language to define my data type: class MyDataType { buffer myBuffer; long myLong; } • Use Jute compiler to generate serialization code: $ rcc –language java mydatatype.jr Yahoo! Presentation Template, Confidential 4 03/30/12

5. Jute Generates Serialization Code for me public void serialize(final org.apache.hadoop.record.RecordOutput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(this,_rio_tag); _rio_a.writeBuffer(myBuffer,"myBuffer"); _rio_a.writeLong(myLong,"myLong"); _rio_a.endRecord(this,_rio_tag); } private void deserializeWithoutFilter(final org.apache.hadoop.record.RecordInput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(_rio_tag); myBuffer=_rio_a.readBuffer("myBuffer"); myLong=_rio_a.readLong("myLong"); _rio_a.endRecord(_rio_tag); } Yahoo! Presentation Template, Confidential 5 03/30/12

6. Today Yahoo audience ETL pipeline processes tens of terabytes of data per day. We rely on Jute. We use Sequence File to store our data. Yahoo! Presentation Template, Confidential 6 03/30/12

7. However, We Want To Use RC Format. Yahoo! Presentation Template, Confidential 7 03/30/12

8. RC File Format • RCFile shares much similarity with Sequence File, but splits a file into row groups. Inside each row group, it stores columns as rows. • Similar data types are grouped together. This potentially brings better compression rate. Yahoo! Presentation Template, Confidential 8 03/30/12

9. Jute only supports Sequence File Format. So We built JuteRC Compiler. Yahoo! Presentation Template, Confidential 9 03/30/12

10. Data Type Yahoo! Presentation Template, Confidental 10 03/30/12

11. Also… • For each JType, overwrite genReadMethod and genWriteMethod. • Changed CodeGenerator in Jute. Yahoo! Presentation Template, Confidential 11 03/30/12

12. Serialization Code Generated by Jute v.s. JuteRC Jute: public void serialize(final org.apache.hadoop.record.RecordOutput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(this,_rio_tag); _rio_a.writeBuffer(myBuffer,"myBuffer"); _rio_a.writeLong(myLong,"myLong"); _rio_a.endRecord(this,_rio_tag); } JuteRC: public class MyDataType extends org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable { public void serialize() { int writeIndx = 0; try { com.yahoo.ccdi.fetl.RcUtil.writeBuffer(this, myBuffer, writeIndx++); com.yahoo.ccdi.fetl.RcUtil.writeLong(this, myLong, writeIndx++); } catch(java.io.IOException e) { } } Yahoo! Presentation Template, Confidential 12 03/30/12

13. Deserialization Code Jute: private void deserializeWithoutFilter(final org.apache.hadoop.record.RecordInput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(_rio_tag); myBuffer=_rio_a.readBuffer("myBuffer"); myLong=_rio_a.readLong("myLong"); _rio_a.endRecord(_rio_tag); } JuteRC: public void deserialize(org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable bra){ int readIndx = 0; try { myBuffer=com.yahoo.ccdi.fetl.RcUtil.readBuffer(bra, readIndx++); myLong=com.yahoo.ccdi.fetl.RcUtil.readLong(bra, readIndx++); }catch(java.io.IOException e) { }}} Yahoo! Presentation Template, Confidential 13 03/30/12

14. Using RC • Convert sequence file format file to RC format: achieved 26~28% file size reduction. • Faster IO performance: reading/writing 0.6X • Process our data using both Hive and PIG on top of HCatalog. Yahoo! Presentation Template, Confidential 14 03/30/12

15. Open Source • We are in the process to open source JuteRC. Under review by Yahoo! Open Source Working Group. • MapReduce programmer can directly plug in the code generated by JuteRC and store their data in RC format. Yahoo! Presentation Template, Confidential 15 03/30/12

16. References • RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11- • Hive RCFile: http://hive.apache.org/docs/r0.7.0/api/org/apache/hadoop/hive/ql/io/RCFile.ht Yahoo! Presentation Template, Confidential 16 03/30/12

March 2012 HUG: JuteRC compiler

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à March 2012 HUG: JuteRC compiler

Similaire à March 2012 HUG: JuteRC compiler (20)

Plus de Yahoo Developer Network

Plus de Yahoo Developer Network (20)

Dernier

Dernier (20)

March 2012 HUG: JuteRC compiler