To Batch Or Not To Batch

To Batch Or Not To Batch
Luca Mearelli
rubyday.it 2011

First and foremost, we believe that speed
is more than a feature. Speed is the most
important feature. If your application is
application is
slow, people won’t use it.
people won’t use it.
Fred Wilson

@lmea #rubyday

Not all the interesting features are fast

Interacting with remote API
Sending emails
Media transcoding
Large dataset handling

@lmea #rubyday

Anatomy of an asynchronous action

The app decides it needs to do a long operation
The app asks the async system to do the
operation and quickly returns the response
The async system executes the operation out-
of-band

@lmea #rubyday

Batch
Asynchronous jobs
Queues & workers

@lmea #rubyday

Batch

@lmea #rubyday

Cron

scheduled operations
unrelated to the requests
low frequency
longer run time

@lmea #rubyday

Anatomy of a cron batch: the rake task

namespace :export do
task :items_xml => :environment do
# read the env variables
# make the export
end
end

@lmea #rubyday

Anatomy of a cron batch: the shell script

#!/bin/sh
# this goes in script/item_export_full.sh
cd /usr/rails/MyApp/current
export RAILS_ENV=production

echo "Item Export Full started: `date`"
rake export:items_xml XML_FOLDER='data/exports'
echo "Item Export Full completed: `date`"

@lmea #rubyday

Anatomy of a cron batch: the crontab entry

0 0 1 * * /usr/rails/MyApp/current/script/item_export_full.sh >> /usr/rails/
MyApp/current/log/dump_item_export.log 2>&1

30 13 * * * cd /usr/rails/MyApp/current; ruby /usr/rails/MyApp/current/script/
runner -e production "Newsletter.deliver_daily" >> /usr/rails/MyApp/current/log/
newsletter_daily.log 2>&1

@lmea #rubyday

Cron helpers

Whenever
https://github.com/javan/whenever

Craken
https://github.com/latimes/craken

@lmea #rubyday

Whenever: schedule.rb

# adds ">> /path/to/file.log 2>&1" to all commands
set :output, '/path/to/file.log'

every 3.hours do
rake "my:rake:task"
end

every 1.day, :at => '4:30 am' do
runner "MyModel.task_to_run_at_four_thirty_in_the_morning"
end

every :hour do # Many shortcuts available: :hour, :day, :month, :year, :reboot
command "/usr/bin/my_great_command", :output => {:error =>
'error.log', :standard => 'cron.log'}
end

@lmea #rubyday

Cracken: raketab

59 * * * * thing:to_do > /tmp/thing_to_do.log 2>&1

@daily solr:reindex > /tmp/solr_daily.log 2>&1

# also @yearly, @annually, @monthly, @weekly, @midnight, @hourly

@lmea #rubyday

Cracken: raketab.rb

Raketab.new do |cron|
cron.schedule 'thing:to_do > /tmp/thing_to_do.log 2>&1',
:every => mon..fri

cron.schedule 'first:five:days > /tmp/thing_to_do.log 2>&1',
:days => [1,2,3,4,5]

cron.schedule 'first:day:q1 > /tmp/thing_to_do.log 2>&1',
:the => '1st', :in => [jan,feb,mar]

cron.schedule 'first:day:q4 > /tmp/thing_to_do.log 2>&1',
:the => '1st', :months => 'October,November,December'
end

@lmea #rubyday

Queues & Workers

un-scheduled operations
responding to a request
mid to high frequency
mixed run time

@lmea #rubyday

Queues & Workers

Delayed job
https://github.com/collectiveidea/delayed_job

Resque
https://github.com/defunkt/resque

@lmea #rubyday

Delayed job

Any object method can be a job
Db backed queue
Integer-based priority
Lifecycle hooks (enqueue, before, after, ... )

@lmea #rubyday

Delayed job: simple jobs

# without delayed_job
@user.notify!(@event)

# with delayed_job
@user.delay.notify!(@event)

# always asyncronous method
class Newsletter
def deliver
# long running method
end
handle_asynchronously :deliver
end

newsletter = Newsletter.new
newsletter.deliver

@lmea #rubyday

Delayed job: handle_asyncronously

handle_asynchronously :sync_method,
:priority => 20

handle_asynchronously :in_the_future,
:run_at => Proc.new { 5.minutes.from_now }

handle_asynchronously :call_a_class_method,
:run_at => Proc.new { when_to_run }

handle_asynchronously :call_an_instance_method,
:priority => Proc.new {|i| i.how_important }

@lmea #rubyday

Delayed job

class NewsletterJob < Struct.new(:text, :emails)
def perform
emails.each { |e| NewsMailer.deliver_text_to_email(text, e) }
end
end

Delayed::Job.enqueue NewsletterJob.new('lorem ipsum...',
User.find(:all).collect(&:email))

@lmea #rubyday

Delayed job

RAILS_ENV=production script/delayed_job -n 2 --min-priority 10 start

RAILS_ENV=production script/delayed_job stop

rake jobs:work

@lmea #rubyday

Delayed job: checking the job status

The queue is for scheduled and running jobs
Handle the status outside Delayed::Job object

@lmea #rubyday


# Include this in your initializers somewhere
class Queue < Delayed::Job
def self.status(id)
self.find_by_id(id).nil? ? "success" : (job.last_error.nil? ? "queued" : "failure")
end
end

# Use this method in your poll method like so:
def poll
status = Queue.status(params[:id])
if status == "success"
# Success, notify the user!
elsif status == "failure"
# Failure, notify the user!
end
end

@lmea #rubyday


class AJob < Struct.new(:options)

def perform
do_something(options)
end

def success(job)
# record success of job.id
Rails.cache.write("status:#{job.id}", "success")
end

end

# a helper
def job_completed_with_success(job_id)
Rails.cache.read("status:#{job_id}")=="success"
end

@lmea #rubyday

Resque

Redis-backed queues
Queue/dequeue speed independent of list size
Forking behaviour
Built in front-end
Multiple queues / no priorities

@lmea #rubyday

Resque: the job

class Export
@queue = :export_jobs

def self.perform(dataset_id, kind = 'full')
ds = Dataset.find(dataset_id)
ds.create_export(kind)
end
end

@lmea #rubyday

Resque: enqueuing the job

class Dataset
def async_create_export(kind)
Resque.enqueue(Export, self.id, kind)
end
end

ds = Dataset.find(100)
ds.async_create_export('full')

@lmea #rubyday

Resque: persisting the job

# jobs are persisted as JSON,
# so jobs should only take arguments that can be expressed as JSON
{
'class': 'Export',
'args': [ 100, 'full' ]
}

# don't do this: Resque.enqueue(Export, self, kind)
# do this:
Resque.enqueue(Export, self.id, kind)

@lmea #rubyday

Resque: generic async methods

# A simple async helper
class Repository < ActiveRecord::Base
# This will be called by a worker when a job needs to be processed
def self.perform(id, method, *args)
find(id).send(method, *args)
end

# We can pass this any Repository instance method that we want to
# run later.
def async(method, *args)
Resque.enqueue(Repository, id, method, *args)
end
end

# Now we can call any method and have it execute later:

@repo.async(:update_disk_usage)
@repo.async(:update_network_source_id, 34)

@lmea #rubyday

Resque: anatomy of a worker

# a worker does this:
start
loop do
if job = reserve
job.process
else
sleep 5
end
end
shutdown

@lmea #rubyday

Resque: working the queues

$ QUEUES=critical,high,low rake resque:work
$ QUEUES=* rake resque:work
$ PIDFILE=./resque.pid QUEUE=export_jobs rake environment resque:work

task "resque:setup" => :environment do
AppConfig.a_parameter = ...
end

@lmea #rubyday

Resque: monit recipe

# example monit monitoring recipe
check process resque_worker_batch_01
with pidfile /app/current/tmp/pids/worker_01.pid

start program = "/bin/bash -c 'cd /app/current; RAILS_ENV=production QUEUE=batch_queue nohup
rake environment resque:work & > log/worker_01.log && echo $! > tmp/pids/worker_01.pid'" as uid
deploy and gid deploy

stop program = "/bin/bash -c 'cd /app/current && kill -s QUIT `cat tmp/pids/worker_01.pid` && rm
-f tmp/pids/worker_01.pid; exit 0;'"

if totalmem is greater than 1000 MB for 10 cycles then restart # eating up memory?

group resque_workers

@lmea #rubyday

Resque: built-in monitoring

@lmea #rubyday

Resque plugins

Resque-status
https://github.com/quirkey/resque-status

Resque-scheduler
https://github.com/bvandenbos/resque-scheduler/

More at: https://github.com/defunkt/resque/wiki/plugins

@lmea #rubyday

Resque-status

Simple trackable jobs for resque
Job instances have a UUID
Jobs can report their status while running

@lmea #rubyday

Resque-status

# inheriting from JobWithStatus
class ExportJob < Resque::JobWithStatus

# perform is an instance method
def perform
limit = options['limit'].to_i || 1000
items = Item.limit(limit)
total = items.count
exported = []
items.each_with_index do |item, num|
at(num, total, "At #{num} of #{total}")
exported << item.to_csv
end

File.open(local_filename, 'w') { |f| f.write(exported.join("n")) }
complete(:filename=>local_filename)
end

end

@lmea #rubyday

Resque-status

job_id = SleepJob.create(:length => 100)
status = Resque::Status.get(job_id)

# the status object tell us:
status.pct_complete #=> 0
status.status #=> 'queued'
status.queued? #=> true
status.working? #=> false
status.time #=> Time object
status.message #=> "Created at ..."

Resque::Status.kill(job_id)

@lmea #rubyday

Resque-scheduler

Queueing for future execution
Scheduling jobs (like cron!)

@lmea #rubyday

Resque-scheduler

# run a job in 5 days
Resque.enqueue_in(5.days, SendFollowupEmail)

# run SomeJob at a specific time
Resque.enqueue_at(5.days.from_now, SomeJob)

@lmea #rubyday

Resque-scheduler

namespace :resque do
task :setup do
require 'resque'
require 'resque_scheduler'
require 'resque/scheduler'

Resque.redis = 'localhost:6379'

# The schedule doesn't need to be stored in a YAML, it just needs to
# be a hash. YAML is usually the easiest.
Resque::Scheduler.schedule = YAML.load_file('your_resque_schedule.yml')

# When dynamic is set to true, the scheduler process looks for
# schedule changes and applies them on the fly.
# Also if dynamic the Resque::Scheduler.set_schedule (and remove_schedule)
# methods can be used to alter the schedule
#Resque::Scheduler.dynamic = true
end
end

$ rake resque:scheduler

@lmea #rubyday

Resque-scheduler: the yaml conﬁguration

queue_documents_for_indexing:
cron: "0 0 * * *"
class: QueueDocuments
queue: high
args:
description: "This job queues all content for indexing in solr"

export_items:
cron: "30 6 * * 1"
class: Export
queue: low
args: full
description: "This job does a weekly export"

@lmea #rubyday

Other (commercial)

SimpleWorker
http://simpleworker.com

SQS
https://github.com/appoxy/aws/

http://rubygems.org/gems/right_aws

http://sdruby.org/video/024_amazon_sqs.m4v

@lmea #rubyday

Other (historical)

Beanstalkd and Stalker
http://asciicasts.com/episodes/243-beanstalkd-and-stalker

http://kr.github.com/beanstalkd/

https://github.com/han/stalker

Backgroundjob (Bj)
https://github.com/ahoward/bj

BackgroundRb
http://backgroundrb.rubyforge.org/

@lmea #rubyday

Other (diﬀerent approaches)

Nanite
http://www.slideshare.net/jendavis100/background-processing-with-nanite

Cloud Crowd
https://github.com/documentcloud/cloud-crowd/wiki/Getting-Started

@lmea #rubyday

Ciao! me@spazidigitali.com

@lmea #rubyday

http://www.flickr.com/photos/rkbcupcakes/3373909785/
http://www.flickr.com/photos/anjin/23460398
http://www.flickr.com/photos/vivacomopuder/3122401239
http://www.flickr.com/photos/pacdog/4968422200
http://www.flickr.com/photos/comedynose/3834416952
http://www.flickr.com/photos/rhysasplundh/5177851910/
http://www.flickr.com/photos/marypcb/104308457
http://www.flickr.com/photos/shutterhacks/4474421855
http://www.flickr.com/photos/kevinschoenmakersnl/5562839479
http://www.flickr.com/photos/triplexpresso/496995086
http://www.flickr.com/photos/saxonmoseley/24523450
http://www.flickr.com/photos/gadl/89650415
http://www.flickr.com/photos/matvey_andreyev/3656451273
http://www.flickr.com/photos/bryankennedy/1992770068
http://www.flickr.com/photos/27282406@N03/4134661728/

@lmea #rubyday

To Batch Or Not To Batch

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à To Batch Or Not To Batch

Similaire à To Batch Or Not To Batch (20)

Plus de Luca Mearelli

Plus de Luca Mearelli (9)

Dernier

Dernier (20)

To Batch Or Not To Batch