SlideShare une entreprise Scribd logo
1  sur  27
Compression, Streaming, and
Data Pipelines, Oh My!
(The Externalities of Data Engineering)
Ilya Ganelin
• The simple becomes complex
• What we expected to work, didn’t
• But one can always find the path
It looked so easy…
• Two data streams:
• ~ 25 GB / Day (Gzip)
• ~ 200 GB / Day (Gzip)
R-Sync
(Vault-8)(Our Partner Team)
Ingest Parse Aggregate &
Model
Store
• Wanted technology that facilitated exploration and iteration
• Planned for streaming in long term
So, we had some data
• Surprise!
• Individual files roll over by time, rather than size
Dataset #1 -- 10 MB per file
Dataset #2 -- 2-10 GB per file
• :GZIP is not a splittable format - can’t be ingested in parallel
• Single core must decompress all data blocks serially
• 1-2 hours / day to parse
Codec Splittable? Compression
Efficiency
Decompression
Speed
Gzip No Medium - High Slow
Snappy No Low Fast
Bzip2 Yes High Slow
LZO No Medium Fast
Lz4 Yes Low Fast
• Needed:
• Splittable, fast decompression, tool-chain compatibility
• Note: your mileage may vary
Hadoop Compression
Hey, you all should try this!
• Lz4 is compatible with our tooling
• Fast decompression time
• 60x performance speed-up over Gzip
• Can use Lz4 CLI to compress in NiFi
At least WE have good data now, right?
• All data-files read as empty in any tools reading from Hadoop
• Surprise!
• There’s Lz4, and there’s Lz4 – Frame compression vs. Streaming Compression
• Hadoop cannot read Lz4 compressed via CLI
• https://issues.apache.org/jira/browse/HADOOP-12990
Ok, let’s fix this!
Solution 1 – Patch Hadoop
• But wait!
• No streaming Lz4 support (would need to add it from scratch)
• Breaks backwards compatibility
• Need new parser
• New new Lz4 format for Hadoop
• Need to update native Lz4 libraries in Hadoop
• This is a big patch!
Solution 2 – Patch NiFi
• Use existing Hadoop Lz4 classes
• Nope.
• No Java Lz4 implementation, Hadoop dynamically loads native C
• Adds Hadoop dependency
• Must compile, build, and dynamically load native code
Solution 3 – Use an OSS Lz4 Library!
• Nothing that can generate data Hadoop can read
• Hadoop’s Lz4 format is no longer documented / supported
• To build it ourselves would need to reverse-engineer Hadoop’s Lz4
• https://github.com/lz4/lz4
• https://github.com/lz4/lz4-java
• https://github.com/carlomedas/4mc
Solution 4 - Brute Force
If you can’t beat ‘em, join ‘em!
• Data sent via TCP Stream to cluster endpoint
• Want:
• Durable
• Compressed data stream direct to HDFS
• Files roll over by SIZE instead of DURATION
• Build ingest pipeline in Apex
• Too many unknowns with Flume; Apex:
• Easy to debug
• Has auto-scaling that Flume lacks
• Has Hadoop support we need
• Also looked at Akka streams for simple solution
Bonus!
• Raw data is huge: 600 MB/min, 900 GB/day
• We don’t use it all!
• Already updating our batch system to avoid re-compute on old data
• Stream it!
• If ingest piece in Apex, why not filtering and parsing?
• Unified system: easy to manage, dramatically reduces data load, and
lets us handle events in real-time
Just Kidding
• We still see TCP resets
• Apex only supports outputting to Gzip and Bzip (we don’t like those)
• Rollover of compressed files doesn’t respect size limit
TCP Resets
• Thought this was a software issue – less likely now
• Able to unit test Apex components to verify our app is working
• Isolated issue to antiquated hardware (10 Mb /sec network interface)
• Quick deployment of Apex provided additional data
Compressed Data Output
• Snappy instead of Lz4 (Hadoop streaming Snappy codec),
• Careful! Hadoop has its own version of Snappy too!
• Extending Apex to add Snappy was trivial (patch coming soon)
• Demonstrated auto-scaling and load balancing of output feeds
• Working on isolating roll-over issue
Lessons Learned
• Don’t change your system without talking to your customers
• Test end to end (including applications) before big changes
• Own your pipelines
• Have a backup plan
• Use extensible and de-buggable tools
Reflections on Open Source
• Just because the code is there, it doesn’t mean it does what you want
• Patching OSS AND getting it merged is not always easy
• Not everything plays nicely together, even the popular tools
• Pluggable solutions for data engineering problems still really exist
References
• https://catchchallenger.first-
world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
• http://stackoverflow.com/questions/37614410/comparison-between-lz4-vs-lz4-hc-vs-
blosc-vs-snappy-vs-fastlz
• http://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable
• https://github.com/lz4/lz4
• https://issues.apache.org/jira/browse/HADOOP-12990
• https://issues.apache.org/jira/browse/NIFI-3420
Compression talk

Contenu connexe

Tendances

Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for HadoopHadoop User Group
 
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDeploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDatabricks
 
Lessons PostgreSQL learned from commercial databases, and didn’t
Lessons PostgreSQL learned from commercial databases, and didn’tLessons PostgreSQL learned from commercial databases, and didn’t
Lessons PostgreSQL learned from commercial databases, and didn’tPGConf APAC
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingSpark Summit
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 
PostgreSQL Enterprise Class Features and Capabilities
PostgreSQL Enterprise Class Features and CapabilitiesPostgreSQL Enterprise Class Features and Capabilities
PostgreSQL Enterprise Class Features and CapabilitiesPGConf APAC
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
Lightening Talk - PostgreSQL Worst Practices
Lightening Talk - PostgreSQL Worst PracticesLightening Talk - PostgreSQL Worst Practices
Lightening Talk - PostgreSQL Worst PracticesPGConf APAC
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 

Tendances (20)

Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDeploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
 
Lessons PostgreSQL learned from commercial databases, and didn’t
Lessons PostgreSQL learned from commercial databases, and didn’tLessons PostgreSQL learned from commercial databases, and didn’t
Lessons PostgreSQL learned from commercial databases, and didn’t
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
PostgreSQL Enterprise Class Features and Capabilities
PostgreSQL Enterprise Class Features and CapabilitiesPostgreSQL Enterprise Class Features and Capabilities
PostgreSQL Enterprise Class Features and Capabilities
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
 
March 2011 HUG: Scaling Hadoop
March 2011 HUG: Scaling HadoopMarch 2011 HUG: Scaling Hadoop
March 2011 HUG: Scaling Hadoop
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Lightening Talk - PostgreSQL Worst Practices
Lightening Talk - PostgreSQL Worst PracticesLightening Talk - PostgreSQL Worst Practices
Lightening Talk - PostgreSQL Worst Practices
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 

Similaire à Compression talk

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQLKonstantin Gredeskoul
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilitycherryhillco
 
Keeping MongoDB Data Safe
Keeping MongoDB Data SafeKeeping MongoDB Data Safe
Keeping MongoDB Data SafeTony Tam
 
"Problems and solutions with generative and non-generative AI models deployme...
"Problems and solutions with generative and non-generative AI models deployme..."Problems and solutions with generative and non-generative AI models deployme...
"Problems and solutions with generative and non-generative AI models deployme...Fwdays
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyCeph Community
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 

Similaire à Compression talk (20)

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalability
 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloud
 
Keeping MongoDB Data Safe
Keeping MongoDB Data SafeKeeping MongoDB Data Safe
Keeping MongoDB Data Safe
 
"Problems and solutions with generative and non-generative AI models deployme...
"Problems and solutions with generative and non-generative AI models deployme..."Problems and solutions with generative and non-generative AI models deployme...
"Problems and solutions with generative and non-generative AI models deployme...
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Zero mq logs
Zero mq logsZero mq logs
Zero mq logs
 

Dernier

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 

Dernier (20)

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 

Compression talk

  • 1. Compression, Streaming, and Data Pipelines, Oh My! (The Externalities of Data Engineering) Ilya Ganelin
  • 2. • The simple becomes complex • What we expected to work, didn’t • But one can always find the path
  • 3. It looked so easy… • Two data streams: • ~ 25 GB / Day (Gzip) • ~ 200 GB / Day (Gzip) R-Sync (Vault-8)(Our Partner Team)
  • 4. Ingest Parse Aggregate & Model Store • Wanted technology that facilitated exploration and iteration • Planned for streaming in long term
  • 5. So, we had some data • Surprise! • Individual files roll over by time, rather than size Dataset #1 -- 10 MB per file Dataset #2 -- 2-10 GB per file • :GZIP is not a splittable format - can’t be ingested in parallel • Single core must decompress all data blocks serially • 1-2 hours / day to parse
  • 6. Codec Splittable? Compression Efficiency Decompression Speed Gzip No Medium - High Slow Snappy No Low Fast Bzip2 Yes High Slow LZO No Medium Fast Lz4 Yes Low Fast • Needed: • Splittable, fast decompression, tool-chain compatibility • Note: your mileage may vary Hadoop Compression
  • 7.
  • 8. Hey, you all should try this! • Lz4 is compatible with our tooling • Fast decompression time • 60x performance speed-up over Gzip • Can use Lz4 CLI to compress in NiFi
  • 9.
  • 10. At least WE have good data now, right? • All data-files read as empty in any tools reading from Hadoop • Surprise! • There’s Lz4, and there’s Lz4 – Frame compression vs. Streaming Compression • Hadoop cannot read Lz4 compressed via CLI • https://issues.apache.org/jira/browse/HADOOP-12990
  • 12. Solution 1 – Patch Hadoop • But wait! • No streaming Lz4 support (would need to add it from scratch) • Breaks backwards compatibility • Need new parser • New new Lz4 format for Hadoop • Need to update native Lz4 libraries in Hadoop • This is a big patch!
  • 13. Solution 2 – Patch NiFi • Use existing Hadoop Lz4 classes • Nope. • No Java Lz4 implementation, Hadoop dynamically loads native C • Adds Hadoop dependency • Must compile, build, and dynamically load native code
  • 14. Solution 3 – Use an OSS Lz4 Library! • Nothing that can generate data Hadoop can read • Hadoop’s Lz4 format is no longer documented / supported • To build it ourselves would need to reverse-engineer Hadoop’s Lz4 • https://github.com/lz4/lz4 • https://github.com/lz4/lz4-java • https://github.com/carlomedas/4mc
  • 15. Solution 4 - Brute Force
  • 16. If you can’t beat ‘em, join ‘em! • Data sent via TCP Stream to cluster endpoint • Want: • Durable • Compressed data stream direct to HDFS • Files roll over by SIZE instead of DURATION
  • 17.
  • 18. • Build ingest pipeline in Apex • Too many unknowns with Flume; Apex: • Easy to debug • Has auto-scaling that Flume lacks • Has Hadoop support we need • Also looked at Akka streams for simple solution
  • 19. Bonus! • Raw data is huge: 600 MB/min, 900 GB/day • We don’t use it all! • Already updating our batch system to avoid re-compute on old data • Stream it! • If ingest piece in Apex, why not filtering and parsing? • Unified system: easy to manage, dramatically reduces data load, and lets us handle events in real-time
  • 20.
  • 21. Just Kidding • We still see TCP resets • Apex only supports outputting to Gzip and Bzip (we don’t like those) • Rollover of compressed files doesn’t respect size limit
  • 22. TCP Resets • Thought this was a software issue – less likely now • Able to unit test Apex components to verify our app is working • Isolated issue to antiquated hardware (10 Mb /sec network interface) • Quick deployment of Apex provided additional data
  • 23. Compressed Data Output • Snappy instead of Lz4 (Hadoop streaming Snappy codec), • Careful! Hadoop has its own version of Snappy too! • Extending Apex to add Snappy was trivial (patch coming soon) • Demonstrated auto-scaling and load balancing of output feeds • Working on isolating roll-over issue
  • 24. Lessons Learned • Don’t change your system without talking to your customers • Test end to end (including applications) before big changes • Own your pipelines • Have a backup plan • Use extensible and de-buggable tools
  • 25. Reflections on Open Source • Just because the code is there, it doesn’t mean it does what you want • Patching OSS AND getting it merged is not always easy • Not everything plays nicely together, even the popular tools • Pluggable solutions for data engineering problems still really exist
  • 26. References • https://catchchallenger.first- world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO • http://stackoverflow.com/questions/37614410/comparison-between-lz4-vs-lz4-hc-vs- blosc-vs-snappy-vs-fastlz • http://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable • https://github.com/lz4/lz4 • https://issues.apache.org/jira/browse/HADOOP-12990 • https://issues.apache.org/jira/browse/NIFI-3420

Notes de l'éditeur

  1. Turns out another team (Team #3) was ALSO using this data No notification / change management process Team #3’s ingest broke, they made a hard cut to another solution
  2. Get from HDFS  Decompress on CLI  Write fixed back to HDFS, Not trivial due to cluster space limitations, Adds an additional step to pipeline & it’s slow! Plan A – Brute force Plan B – Get our own ingest pipeline
  3. Stream #1 (25 GB / day, compressed) Works! Stream #2 (200 GB / day, compressed) Seems fine for us but upstream system sees constant TCP resets Eventually breaks upstream syslog provider No way to debug Flume Configuration changes don’t help, too many unknowns