11. Pipes and Pipelines
• Pipes contain jobs
• Pipeline is a group of pipe
• Easy to create pipelines and add pipes
5-Apr-09 CloudCamp - Bangalore
12. Job ORM
SQS API
class CrawlerJob (JobBase):
SDB API
class SDBInterfaceConfig:
domain_name = settings.CRAWLER_JOB_DOMAIN
class SQSInterfaceConfig:
queue_name = settings.CRAWLER_JOB_QUEUE
timeout = settings.CRAWLER_JOB_TIMEOUT
class AWSMetaData:
action = CharField (...)
url = CharField (...)
...
...
Default attributes of each Job:
• Pipeline Name
• Status
• Start Time
• End Time
• Id
5-Apr-09 CloudCamp - Bangalore
13. Job Processing
for i in range (num_of_jobs):
try:
job = cls.jobclass.sqs_get() # process job
...
except Exception, e:
job.job_processing_complete(…)
fsdebug.mail_admins (..)
end_transaction(rollback = True)
job.sdb_save() # save in error store
finally:
job.sqs_del() # delete the job
5-Apr-09 CloudCamp - Bangalore
14. The Good
• Architecture easy to extend
• ORM approach is a big time saver
• Simple to add new services
5-Apr-09 CloudCamp - Bangalore
15. The Bad
• Messages may be lost
– Service Failure
– SQS deletes messages after 4 days.
Imp: System should be able to recreate jobs
5-Apr-09 CloudCamp - Bangalore
17. What do we store?
• Crawler Data – Web Pages
• Extracted Content – Questions/Answers
• Backups
5-Apr-09 CloudCamp - Bangalore
18. Storage Structure
Meta Data Key + Value
Postgres S3
5-Apr-09 CloudCamp - Bangalore
19. ORM
• Extended Django ORM to support S3
class S3WebPage (S3Model):
_allowed_attrs = [quot;urlquot;, quot;contentquot;, ..]
_name = quot;S3WebPage“
...
...
5-Apr-09 CloudCamp - Bangalore
20. The Good
• Extremely scalable
• Possible to store Python objects in S3
• Latency issues can be solved by using a
caching layer
• No need to backup S3 data
• Storage is cheap
5-Apr-09 CloudCamp - Bangalore
21. The Bad
• Postgres + S3 is not an elegant solution
– Periodic syncing of Postgres and S3 required
• High transaction costs
– $.01 per 1000 PUT,COPY,POST or LIST Requests
– $.01 per 10000 GET Requests
5-Apr-09 CloudCamp - Bangalore
23. EC2 – The Good
• Computing needs are not constant
• Data transfer to other AWS services is free
• AMI’s per node type
5-Apr-09 CloudCamp - Bangalore
24. The bad
• Missed having a nerve center
– Budget
– Job Load
– CPU load
• Low cost 64bit severs are not available
5-Apr-09 CloudCamp - Bangalore