In the engineering world, we don’t always have the luxury of owning our data pipelines end to end. If only we could influence those outside components… Well, we tried, and this our story - replete with failure, discovery, and the serenity of enlightenment. Join us on our journey as we learned more than we ever wanted to know about compression in different Apache projects, deployed our own ingestion pipeline in Apache Flume, and ultimately unified these in a robust framework built on Apache Apex handling 1 TB of data per day. We end with some reflections on the joys and tribulations of the open source realm and some key lessons for other large applications atop multiple Apache solutions.
2. • The simple becomes complex
• What we expected to work, didn’t
• But one can always find the path
3. It looked so easy…
• Two data streams:
• ~ 25 GB / Day (Gzip)
• ~ 200 GB / Day (Gzip)
R-Sync
(Vault-8)(Our Partner Team)
4. Ingest Parse Aggregate &
Model
Store
• Wanted technology that facilitated exploration and iteration
• Planned for streaming in long term
5. So, we had some data
• Surprise!
• Individual files roll over by time, rather than size
Dataset #1 -- 10 MB per file
Dataset #2 -- 2-10 GB per file
• :GZIP is not a splittable format - can’t be ingested in parallel
• Single core must decompress all data blocks serially
• 1-2 hours / day to parse
6. Codec Splittable? Compression
Efficiency
Decompression
Speed
Gzip No Medium - High Slow
Snappy No Low Fast
Bzip2 Yes High Slow
LZO No Medium Fast
Lz4 Yes Low Fast
• Needed:
• Splittable, fast decompression, tool-chain compatibility
• Note: your mileage may vary
Hadoop Compression
7.
8. Hey, you all should try this!
• Lz4 is compatible with our tooling
• Fast decompression time
• 60x performance speed-up over Gzip
• Can use Lz4 CLI to compress in NiFi
9.
10. At least WE have good data now, right?
• All data-files read as empty in any tools reading from Hadoop
• Surprise!
• There’s Lz4, and there’s Lz4 – Frame compression vs. Streaming Compression
• Hadoop cannot read Lz4 compressed via CLI
• https://issues.apache.org/jira/browse/HADOOP-12990
12. Solution 1 – Patch Hadoop
• But wait!
• No streaming Lz4 support (would need to add it from scratch)
• Breaks backwards compatibility
• Need new parser
• New new Lz4 format for Hadoop
• Need to update native Lz4 libraries in Hadoop
• This is a big patch!
13. Solution 2 – Patch NiFi
• Use existing Hadoop Lz4 classes
• Nope.
• No Java Lz4 implementation, Hadoop dynamically loads native C
• Adds Hadoop dependency
• Must compile, build, and dynamically load native code
14. Solution 3 – Use an OSS Lz4 Library!
• Nothing that can generate data Hadoop can read
• Hadoop’s Lz4 format is no longer documented / supported
• To build it ourselves would need to reverse-engineer Hadoop’s Lz4
• https://github.com/lz4/lz4
• https://github.com/lz4/lz4-java
• https://github.com/carlomedas/4mc
16. If you can’t beat ‘em, join ‘em!
• Data sent via TCP Stream to cluster endpoint
• Want:
• Durable
• Compressed data stream direct to HDFS
• Files roll over by SIZE instead of DURATION
17.
18. • Build ingest pipeline in Apex
• Too many unknowns with Flume; Apex:
• Easy to debug
• Has auto-scaling that Flume lacks
• Has Hadoop support we need
• Also looked at Akka streams for simple solution
19. Bonus!
• Raw data is huge: 600 MB/min, 900 GB/day
• We don’t use it all!
• Already updating our batch system to avoid re-compute on old data
• Stream it!
• If ingest piece in Apex, why not filtering and parsing?
• Unified system: easy to manage, dramatically reduces data load, and
lets us handle events in real-time
20.
21. Just Kidding
• We still see TCP resets
• Apex only supports outputting to Gzip and Bzip (we don’t like those)
• Rollover of compressed files doesn’t respect size limit
22. TCP Resets
• Thought this was a software issue – less likely now
• Able to unit test Apex components to verify our app is working
• Isolated issue to antiquated hardware (10 Mb /sec network interface)
• Quick deployment of Apex provided additional data
23. Compressed Data Output
• Snappy instead of Lz4 (Hadoop streaming Snappy codec),
• Careful! Hadoop has its own version of Snappy too!
• Extending Apex to add Snappy was trivial (patch coming soon)
• Demonstrated auto-scaling and load balancing of output feeds
• Working on isolating roll-over issue
24. Lessons Learned
• Don’t change your system without talking to your customers
• Test end to end (including applications) before big changes
• Own your pipelines
• Have a backup plan
• Use extensible and de-buggable tools
25. Reflections on Open Source
• Just because the code is there, it doesn’t mean it does what you want
• Patching OSS AND getting it merged is not always easy
• Not everything plays nicely together, even the popular tools
• Pluggable solutions for data engineering problems still really exist
Turns out another team (Team #3) was ALSO using this data
No notification / change management process
Team #3’s ingest broke, they made a hard cut to another solution
Get from HDFS Decompress on CLI Write fixed back to HDFS, Not trivial due to cluster space limitations, Adds an additional step to pipeline & it’s slow!
Plan A – Brute force
Plan B – Get our own ingest pipeline
Stream #1 (25 GB / day, compressed)
Works!
Stream #2 (200 GB / day, compressed)
Seems fine for us but upstream system sees constant TCP resets
Eventually breaks upstream syslog provider
No way to debug Flume
Configuration changes don’t help, too many unknowns