SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Building Bulk Import
Eric Newton
SW Complete, Inc.
Ingest 101
The life of a mutation:
○ Send to server
○ Write to Write-Ahead Log
○ Store in memory
○ Write memory to a file
○ Merge, re-write as needed
At least two writes, small in-memory sort
Bulk Ingest
● Accumulo: heavy ingest
● Use Map-Reduce efficiency
● Pre-sort incoming data
● Hand whole sorted files to accumulo
● One write
● Larger sorts
Version 1
importDirectory(String dir, String failDir)
Client computes the servers that need the file:
■ Analysis of file
■ Moves directory under Accumulo
■ Retry logic (to handle splits, failures)
Version 1: problems
● Limited to the client’s computational power
● Permission: client had to be all-knowing
● Files could be added to servers many times
● Defer file collection while bulk importing
● Clients can fail
Version 1.1
● Clients hand bulk imports to the master
● Fixed permission problems
● Added a bulk import test to the Random
Walk test suit
The Test
Create a sorted file with a lot of 1’s:
12345678 -> 1
Create an identical file, with lots of -1’s:
12345678 -> -1
Add a summation iterator over the table
Verify: every entry should be zero:
12345678 -> 0
Random Walk 101
Randomly:
○ Import files in random order
○ Split the files into random sizes
○ Split tablets and random points
○ Kill tablet servers (agitate)
Under load
At scale
Version 2.0
● Master only coordinates file processing
● Distribute work to tablet servers
● Master distributes files to tablet servers
● Tablet serves
○ Analyze files for assignments
○ Retry
○ Communicate with destination tablet servers
FilesFilesFiles
Command Flow
Client Master
Tablet
Server
Tablet
Server
Tablet
Server
Tablet
Server
Problems
Problems solved:
○ Permissions controlled by Master
○ File processing is distributed
Bulk import tested heavily
Consistency
○ Not so much
Version 3.0
Problem: file imported more than once
○ Repeated reloading
○ RPC timeouts
○ Tablet migration
○ Tablet split
Add flags to metadata table to prevent:
○ file garbage collection
○ repeated imports
Version 3.0
● Problem Solved
○ Reduced Name Node ops
○ Reduced trash laying around from failed imports
○ Imports not repeated
● Does it stand up to the Random Walk Test?
○ Not so much
Death by Slow Thread
“Please take this file”
“OK… looks good, never saw this before”
Sleep
“Hey, Please take this file, again”
“OK… looks good, never saw this before”
“Thanks!”
Compact!
Wakeup! Time to import that file from the 1st
request!
Zookeeper to the Rescue
Add a 3rd
party negotiator
● Define a session
● Add a file only while session is active
● Store session in zookeeper
● Get agreement about the session at each critical point,
including metadata table updates
Session guides clean-up of markers
Session closes only after all agree
Session
“Take this file, for session 123”
“OK, working on session 123”
“Never saw this file before”
Sleep
“Repeat, file for session 123”
“Never saw this file before”
“Anybody working on session 123?”
Session
“Yes, I’m processing session 123!”
“OK, finish up.”
Double check on session 123, import file
“Anybody working on session 123?”
“What’s session 123?”
Remove markers in metadata
Bulk Import
● Problem solved:
○ Distributed processing
○ Permissions
○ Files imported once, and only once
○ Markers are cleaned up in the face of failures
● Performance
○ Not so much
Master as Bottleneck
Bulk import thousands of files
Every 15 minutes
Master renames files and puts them under /accumulo
Bottleneck
One master competes for NN ops with N tablet servers
Master, Go Faster
Add configurable thread pool
Push more move requests to the NN
Compete more fiercely for resources
Bulk Import
More efficient data ingest
Easy: just a file to the right tablets
Hard: consistency in a distributed system
Testing is your friend
Nothing prepares you for large-scale problems!

Contenu connexe

Tendances

Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyOlivier Bourgeois
 
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael Medin
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael MedinOSMC 2012 | Distributed Monitoring mit NSClient++ by Michael Medin
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael MedinNETWAYS
 
HTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS moduleHTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS moduleKazuho Oku
 
Writing a fast HTTP parser
Writing a fast HTTP parserWriting a fast HTTP parser
Writing a fast HTTP parserfukamachi
 
4 exercises for part 1
4   exercises for part 14   exercises for part 1
4 exercises for part 1drewz lin
 
Gsummit apis-2013
Gsummit apis-2013Gsummit apis-2013
Gsummit apis-2013Gluster.org
 
Lcna tutorial-2012
Lcna tutorial-2012Lcna tutorial-2012
Lcna tutorial-2012Gluster.org
 
Harden Security Devices Against Increasingly Sophisticated Evasions
Harden Security Devices Against Increasingly Sophisticated EvasionsHarden Security Devices Against Increasingly Sophisticated Evasions
Harden Security Devices Against Increasingly Sophisticated EvasionsIxia
 
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017Codemotion
 
AV Evasion with the Veil Framework
AV Evasion with the Veil FrameworkAV Evasion with the Veil Framework
AV Evasion with the Veil FrameworkVeilFramework
 
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataGlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataRoberto Franchini
 
Woo: Writing a fast web server @ ELS2015
Woo: Writing a fast web server @ ELS2015Woo: Writing a fast web server @ ELS2015
Woo: Writing a fast web server @ ELS2015fukamachi
 
Multi-core Node.pdf
Multi-core Node.pdfMulti-core Node.pdf
Multi-core Node.pdfAhmed Hassan
 
LinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingLinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingAndrey Vagin
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiGluster.org
 

Tendances (20)

Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development Efficiency
 
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael Medin
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael MedinOSMC 2012 | Distributed Monitoring mit NSClient++ by Michael Medin
OSMC 2012 | Distributed Monitoring mit NSClient++ by Michael Medin
 
HTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS moduleHTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS module
 
Writing a fast HTTP parser
Writing a fast HTTP parserWriting a fast HTTP parser
Writing a fast HTTP parser
 
4 exercises for part 1
4   exercises for part 14   exercises for part 1
4 exercises for part 1
 
Veil-Ordnance
Veil-OrdnanceVeil-Ordnance
Veil-Ordnance
 
Barcamp presentation
Barcamp presentationBarcamp presentation
Barcamp presentation
 
Gsummit apis-2013
Gsummit apis-2013Gsummit apis-2013
Gsummit apis-2013
 
Lcna tutorial-2012
Lcna tutorial-2012Lcna tutorial-2012
Lcna tutorial-2012
 
Harden Security Devices Against Increasingly Sophisticated Evasions
Harden Security Devices Against Increasingly Sophisticated EvasionsHarden Security Devices Against Increasingly Sophisticated Evasions
Harden Security Devices Against Increasingly Sophisticated Evasions
 
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
Asynchronous IO in Rust - Enrico Risa - Codemotion Rome 2017
 
AV Evasion with the Veil Framework
AV Evasion with the Veil FrameworkAV Evasion with the Veil Framework
AV Evasion with the Veil Framework
 
SELinux by Example
SELinux by ExampleSELinux by Example
SELinux by Example
 
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataGlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big data
 
Woo: Writing a fast web server @ ELS2015
Woo: Writing a fast web server @ ELS2015Woo: Writing a fast web server @ ELS2015
Woo: Writing a fast web server @ ELS2015
 
Multi-core Node.pdf
Multi-core Node.pdfMulti-core Node.pdf
Multi-core Node.pdf
 
rsyslog meets docker
rsyslog meets dockerrsyslog meets docker
rsyslog meets docker
 
LinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel TestingLinuxCon 2011: OpenVZ and Linux Kernel Testing
LinuxCon 2011: OpenVZ and Linux Kernel Testing
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan Rossi
 
Linux for Beginners
Linux for  BeginnersLinux for  Beginners
Linux for Beginners
 

En vedette

Accumulo design
Accumulo designAccumulo design
Accumulo designscsorensen
 
Accumulo meetup 20130109
Accumulo meetup 20130109Accumulo meetup 20130109
Accumulo meetup 20130109Sqrrl
 
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...Accumulo Summit
 
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]Accumulo Summit
 
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]Accumulo Summit
 
Accumulo Summit 2014: Monitoring Apache Accumulo
Accumulo Summit 2014: Monitoring Apache AccumuloAccumulo Summit 2014: Monitoring Apache Accumulo
Accumulo Summit 2014: Monitoring Apache AccumuloAccumulo Summit
 
Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data LakeAaron Cordova
 
Accumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the EnterpriseAccumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the EnterpriseAccumulo Summit
 
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit
 
Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo ClustersAaron Cordova
 
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit
 
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick IntroductionJames Salter
 
Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411Sqrrl
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit
 
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big DataOct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big DataYahoo Developer Network
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit
 
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...Accumulo Summit
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo OverviewBill Havanki
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosJoe Stein
 

En vedette (20)

Accumulo design
Accumulo designAccumulo design
Accumulo design
 
Accumulo meetup 20130109
Accumulo meetup 20130109Accumulo meetup 20130109
Accumulo meetup 20130109
 
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
 
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
 
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
 
Accumulo Summit 2014: Monitoring Apache Accumulo
Accumulo Summit 2014: Monitoring Apache AccumuloAccumulo Summit 2014: Monitoring Apache Accumulo
Accumulo Summit 2014: Monitoring Apache Accumulo
 
Apache Accumulo and the Data Lake
Apache Accumulo and the Data LakeApache Accumulo and the Data Lake
Apache Accumulo and the Data Lake
 
Accumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the EnterpriseAccumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit 2016: Accumulo in the Enterprise
 
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
 
Large Scale Accumulo Clusters
Large Scale Accumulo ClustersLarge Scale Accumulo Clusters
Large Scale Accumulo Clusters
 
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in AccumuloAccumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
 
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick Introduction
 
Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411Sqrrl real time_big_data_20130411
Sqrrl real time_big_data_20130411
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
 
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big DataOct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
 
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...
Accumulo Summit 2016: You Won't Believe These 3 Tricks for Maximizing Accumul...
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo Overview
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
 

Similaire à Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Accumulo14 15
Accumulo14 15Accumulo14 15
Accumulo14 15Sqrrl
 
Lcna 2012-example
Lcna 2012-exampleLcna 2012-example
Lcna 2012-exampleGluster.org
 
Accumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAccumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAaron Cordova
 
Stashaway 1
Stashaway 1Stashaway 1
Stashaway 1priestc
 
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagBrown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagCask Data
 
Bacula Overview
Bacula OverviewBacula Overview
Bacula Overviewsambismo
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Linux Distribution Automated Testing
 Linux Distribution Automated Testing Linux Distribution Automated Testing
Linux Distribution Automated TestingAleksander Baranowski
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
Toolchain Independent Distributed Compilation
Toolchain Independent Distributed CompilationToolchain Independent Distributed Compilation
Toolchain Independent Distributed CompilationDietmar Hauser
 
Source andassetcontrolingamedev
Source andassetcontrolingamedevSource andassetcontrolingamedev
Source andassetcontrolingamedevMatt Benic
 
Introduction to containers
Introduction to containersIntroduction to containers
Introduction to containersNitish Jadia
 
Remote file path traversal attacks for fun and profit
Remote file path traversal attacks for fun and profitRemote file path traversal attacks for fun and profit
Remote file path traversal attacks for fun and profitDharmalingam Ganesan
 
Technical track-afterimaging Progress Database
Technical track-afterimaging Progress DatabaseTechnical track-afterimaging Progress Database
Technical track-afterimaging Progress DatabaseVinh Nguyen
 
Course 102: Lecture 16: Process Management (Part 2)
Course 102: Lecture 16: Process Management (Part 2) Course 102: Lecture 16: Process Management (Part 2)
Course 102: Lecture 16: Process Management (Part 2) Ahmed El-Arabawy
 
Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+Python Ireland
 
Utopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K usersUtopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K usersJaime Buelta
 
Google File System
Google File SystemGoogle File System
Google File SystemDreamJobs1
 

Similaire à Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored] (20)

Accumulo14 15
Accumulo14 15Accumulo14 15
Accumulo14 15
 
Lcna 2012-example
Lcna 2012-exampleLcna 2012-example
Lcna 2012-example
 
Accumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and RoadmapAccumulo 1.4 Features and Roadmap
Accumulo 1.4 Features and Roadmap
 
Stashaway 1
Stashaway 1Stashaway 1
Stashaway 1
 
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bagBrown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
Brown Bag : CDAP (f.k.a Reactor) Streams Deep DiveStream on file brown bag
 
Bacula Overview
Bacula OverviewBacula Overview
Bacula Overview
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Linux Distribution Automated Testing
 Linux Distribution Automated Testing Linux Distribution Automated Testing
Linux Distribution Automated Testing
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
Toolchain Independent Distributed Compilation
Toolchain Independent Distributed CompilationToolchain Independent Distributed Compilation
Toolchain Independent Distributed Compilation
 
Source andassetcontrolingamedev
Source andassetcontrolingamedevSource andassetcontrolingamedev
Source andassetcontrolingamedev
 
Introduction to containers
Introduction to containersIntroduction to containers
Introduction to containers
 
SNIA SDC 2016 final
SNIA SDC 2016 finalSNIA SDC 2016 final
SNIA SDC 2016 final
 
Remote file path traversal attacks for fun and profit
Remote file path traversal attacks for fun and profitRemote file path traversal attacks for fun and profit
Remote file path traversal attacks for fun and profit
 
CSA-lecture 6.pptx
CSA-lecture 6.pptxCSA-lecture 6.pptx
CSA-lecture 6.pptx
 
Technical track-afterimaging Progress Database
Technical track-afterimaging Progress DatabaseTechnical track-afterimaging Progress Database
Technical track-afterimaging Progress Database
 
Course 102: Lecture 16: Process Management (Part 2)
Course 102: Lecture 16: Process Management (Part 2) Course 102: Lecture 16: Process Management (Part 2)
Course 102: Lecture 16: Process Management (Part 2)
 
Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+
 
Utopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K usersUtopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K users
 
Google File System
Google File SystemGoogle File System
Google File System
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

  • 1. Building Bulk Import Eric Newton SW Complete, Inc.
  • 2. Ingest 101 The life of a mutation: ○ Send to server ○ Write to Write-Ahead Log ○ Store in memory ○ Write memory to a file ○ Merge, re-write as needed At least two writes, small in-memory sort
  • 3. Bulk Ingest ● Accumulo: heavy ingest ● Use Map-Reduce efficiency ● Pre-sort incoming data ● Hand whole sorted files to accumulo ● One write ● Larger sorts
  • 4. Version 1 importDirectory(String dir, String failDir) Client computes the servers that need the file: ■ Analysis of file ■ Moves directory under Accumulo ■ Retry logic (to handle splits, failures)
  • 5. Version 1: problems ● Limited to the client’s computational power ● Permission: client had to be all-knowing ● Files could be added to servers many times ● Defer file collection while bulk importing ● Clients can fail
  • 6. Version 1.1 ● Clients hand bulk imports to the master ● Fixed permission problems ● Added a bulk import test to the Random Walk test suit
  • 7. The Test Create a sorted file with a lot of 1’s: 12345678 -> 1 Create an identical file, with lots of -1’s: 12345678 -> -1 Add a summation iterator over the table Verify: every entry should be zero: 12345678 -> 0
  • 8. Random Walk 101 Randomly: ○ Import files in random order ○ Split the files into random sizes ○ Split tablets and random points ○ Kill tablet servers (agitate) Under load At scale
  • 9. Version 2.0 ● Master only coordinates file processing ● Distribute work to tablet servers ● Master distributes files to tablet servers ● Tablet serves ○ Analyze files for assignments ○ Retry ○ Communicate with destination tablet servers
  • 11. Problems Problems solved: ○ Permissions controlled by Master ○ File processing is distributed Bulk import tested heavily Consistency ○ Not so much
  • 12. Version 3.0 Problem: file imported more than once ○ Repeated reloading ○ RPC timeouts ○ Tablet migration ○ Tablet split Add flags to metadata table to prevent: ○ file garbage collection ○ repeated imports
  • 13. Version 3.0 ● Problem Solved ○ Reduced Name Node ops ○ Reduced trash laying around from failed imports ○ Imports not repeated ● Does it stand up to the Random Walk Test? ○ Not so much
  • 14. Death by Slow Thread “Please take this file” “OK… looks good, never saw this before” Sleep “Hey, Please take this file, again” “OK… looks good, never saw this before” “Thanks!” Compact! Wakeup! Time to import that file from the 1st request!
  • 15. Zookeeper to the Rescue Add a 3rd party negotiator ● Define a session ● Add a file only while session is active ● Store session in zookeeper ● Get agreement about the session at each critical point, including metadata table updates Session guides clean-up of markers Session closes only after all agree
  • 16. Session “Take this file, for session 123” “OK, working on session 123” “Never saw this file before” Sleep “Repeat, file for session 123” “Never saw this file before” “Anybody working on session 123?”
  • 17. Session “Yes, I’m processing session 123!” “OK, finish up.” Double check on session 123, import file “Anybody working on session 123?” “What’s session 123?” Remove markers in metadata
  • 18. Bulk Import ● Problem solved: ○ Distributed processing ○ Permissions ○ Files imported once, and only once ○ Markers are cleaned up in the face of failures ● Performance ○ Not so much
  • 19. Master as Bottleneck Bulk import thousands of files Every 15 minutes Master renames files and puts them under /accumulo Bottleneck One master competes for NN ops with N tablet servers
  • 20. Master, Go Faster Add configurable thread pool Push more move requests to the NN Compete more fiercely for resources
  • 21. Bulk Import More efficient data ingest Easy: just a file to the right tablets Hard: consistency in a distributed system Testing is your friend Nothing prepares you for large-scale problems!