Contenu connexe Similaire à 從零開始的爬蟲之旅 Crawler from zero (20) 從零開始的爬蟲之旅 Crawler from zero2. About Me
• a.k.a. Shi-Ken Don
• 2009 - 2014
• Ruby Developer since 2013
• 2014 - 2015
• Web Developer at Backer-Founder since 2015
3. About Me
• a.k.a. Shi-Ken Don
• 2009 - 2014
• Ruby Developer since 2013
• 2014 - 2015
• Web Developer at Backer-Founder since 2015
200
20
20. Ruby Python
user system total real
Ruby Thread.new 16.090000 1.010000 17.100000 ( 17.254499)
Ruby Parallel 16.900000 1.100000 18.000000 ( 18.080813)
Python threading 0.000000 0.000000 16.360000 ( 16.532583)
1000 thread www.facebook.com
ruby_thread_tests.rb
threading_test.py
Python Ruby
21. user system total real
Thread.new (parse and create) 5.640000 0.580000 6.220000 ( 6.073333)
Thread.new (parse only) 5.060000 0.440000 5.500000 ( 5.455837)
Thread.new (create directly) 3.340000 0.520000 3.860000 ( 5.169519)
ruby_thread_sidekiq_test.rb
100 thread www.facebook.com
parse only
26. Heroku Auto Scaling
heroku.rake
task :auto_scaling, [:WORKER_NAME] => :environment do |_t, args|
args.with_defaults(WORKER_NAME: "worker")
APP_NAME = ENV["HEROKU_APP_NAME"]
WORKER_NAME = args[:WORKER_NAME]
heroku = Heroku::API.new
queues = Sidekiq::Queue.all
queues_size = queues.map { |queue| Sidekiq::Queue.new(queue.name).size }.inject(0, :+)
# 2X dyno 600 jobs
# 50 parse project_log
# jobs 45
now_minutes = Time.now.strftime("%M").to_i
# / /
left_minutes = now_minutes.between?(0, 45) ? 45 - now_minutes : 0
workers_size = queues_size / 500 / [left_minutes, 1].max
workers_size = 1 if workers_size < 1
workers_size = 10 if workers_size > 10 # 10 worker
puts "Scaling #{WORKER_NAME} dyno count to #{workers_size}"
heroku.post_ps_scale(APP_NAME, WORKER_NAME, workers_size)
end
30. • File descriptor
✦ MacOS: 256 (default)
✦ Linux: 1024 (default)
✦ Windows: who cares
Linux File descriptor
1024
CPU RAM
37. • User-Agent
• Mozilla/5.0 (compatible; MSIE 10.0;Windows NT 6.1;Trident/6.0)
• Google Bot
• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
38. • User-Agent
• Mozilla/5.0 (compatible; MSIE 10.0;Windows NT 6.1;Trident/6.0)
• Google Bot
• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
SEO
Googlebot
ψ( ∇´)ψ
39. User-Agent
DEFAULT_USER_AGENT = "Mozilla/5.0 (compatible;
CrowdTrail/1.0; +https://crowdwatch.tw/)"
def random_user_agent_string
format(
"%s Random/0.%d.%d",
DEFAULT_USER_AGENT,
Random.rand(100),
Random.rand(100)
)
end
HTTParty.get("https://www.facebook.com", headers:
{ "User-Agent" => random_user_agent_string })
46. Integer Float to_i
# Bad
doc.css('.tab .pledged').text.to_i
# Good
Integer(doc.css('.tab .pledged').text)
to_i nil 0