Yahoo’s data ETL pipeline continuously processes more than tens of terabytes of data every day. Seeking for a good data storage methodology that can store and fetch this data efficiently has always been a challenge for the Yahoo data ETL pipeline. A study done recently inside Yahoo has shown a dramatic data size reduction by switching from Sequence to RC File Format. We have decided to take the approach of converting our data to the RC File Format. The most challenging task is to manually serialize the data objects. We rely on Jute, a Hadoop Record Compiler, to provide serialization code. However, Jute does not support RC File Format. In addition, RC file format does not support native Hadoop writable objects. Therefore writing serialization code becomes complicated and repetitive. Hence, we invented the JuteRC compiler which is an extension to the Hadoop Record Compiler (Jute). It generates serialization/deserialization code for any user defined primitive or composite data types. MapReduce programmer can directly plug in the serialization/deserialization code to generate MapReduce output data file that is in RC File Storage Format. With the help of JuteRC compiler, our experiment against Yahoo audience data showed a 26-28% file size reduction and 40% read/write performance improvement compared to Sequence File. We are currently in the process to open source JuteRC.
The Ultimate Guide to Choosing WordPress Pros and Cons
March 2012 HUG: JuteRC compiler
1. Hadoop Jute RC Compiler
Tanping Wang
Yahoo! User Data Analytics
2. Agenda
• How we use Jute compiler today
• JuteRC compiler
• By using JuteRC, what we have achieved
Yahoo! Presentation Template, Confidential 2 03/30/12
3. Hadoop Record Compiler (Jute)
• Generates serialization code to store data in
the Sequence File format
org.apache.hadoop.record
Yahoo! Presentation Template, Confidential 3 03/30/12
4. How Do We Use Jute Today
• Use Data Definition Language to define my
data type:
class MyDataType {
buffer myBuffer;
long myLong;
}
• Use Jute compiler to generate serialization
code:
$ rcc –language java mydatatype.jr
Yahoo! Presentation Template, Confidential 4 03/30/12
5. Jute Generates Serialization Code for me
public void serialize(final org.apache.hadoop.record.RecordOutput _rio_a, final String
_rio_tag)
throws java.io.IOException {
_rio_a.startRecord(this,_rio_tag);
_rio_a.writeBuffer(myBuffer,"myBuffer");
_rio_a.writeLong(myLong,"myLong");
_rio_a.endRecord(this,_rio_tag);
}
private void deserializeWithoutFilter(final org.apache.hadoop.record.RecordInput
_rio_a, final String _rio_tag)
throws java.io.IOException {
_rio_a.startRecord(_rio_tag);
myBuffer=_rio_a.readBuffer("myBuffer");
myLong=_rio_a.readLong("myLong");
_rio_a.endRecord(_rio_tag);
}
Yahoo! Presentation Template, Confidential 5 03/30/12
6. Today Yahoo audience ETL pipeline processes
tens of terabytes of data per day.
We rely on Jute. We use Sequence File to
store our data.
Yahoo! Presentation Template, Confidential 6 03/30/12
7. However, We Want To Use RC Format.
Yahoo! Presentation Template, Confidential 7 03/30/12
8. RC File Format
• RCFile shares much similarity with Sequence File, but splits a file
into row groups. Inside each row group, it stores columns as rows.
• Similar data types are grouped together. This potentially brings
better compression rate.
Yahoo! Presentation Template, Confidential 8 03/30/12
9. Jute only supports Sequence File Format.
So We built JuteRC Compiler.
Yahoo! Presentation Template, Confidential 9 03/30/12
14. Using RC
• Convert sequence file format file to RC format:
achieved 26~28% file size reduction.
• Faster IO performance: reading/writing 0.6X
• Process our data using both Hive and PIG on
top of HCatalog.
Yahoo! Presentation Template, Confidential 14 03/30/12
15. Open Source
• We are in the process to open source JuteRC.
Under review by Yahoo! Open Source
Working Group.
• MapReduce programmer can directly plug in
the code generated by JuteRC and store their
data in RC format.
Yahoo! Presentation Template, Confidential 15 03/30/12
16. References
• RCFile: A Fast and Space-efficient Data Placement Structure in
MapReduce-based Warehouse Systems
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-
• Hive RCFile:
http://hive.apache.org/docs/r0.7.0/api/org/apache/hadoop/hive/ql/io/RCFile.ht
Yahoo! Presentation Template, Confidential 16 03/30/12