Contenu connexe
Similaire à Spider进化论 (20)
Spider进化论
- 2. Topic
• Simplest Spider
• Framework(Scrapy)
– Abstraction
– IO Model
• Evolution
– Architecture
– Module
• Simplify
• Do it
- 8. Framework(scrapy)---IO Model)
• Concepts
– Synchronous/Asynchronous(IO state consistency)
– Block/Nonblock(Process/Thread status)
• IO Model
– Synchronous Block(urllilb)
– Asynchronous Block(spynner, gevent, nginx_lua)
– Asynchronous
NonBlock(twisted, reactor, proactor)
– Synchronous NonBlock(mistery)
- 11. Evolution---Module
• Downloader
– Render
• Webkit(Javascript)
• Webkit(AJAX):click simulation, event notify
• Webkit(CSS): css feature
– ADSL Proxy
• How to get
– Why scan by ourselves
• How to use
– Why nginx
- 12. Evolution---Module
• Extractor
– Wrapper induction
• Semi automation
– Firefox extensions
– How to improve
– Templates management
• Full automation
– Scrapy extract tool
• Cascade extraction supported
- 13. Evolution---Module
• Scheduler
– FIFO Queue
– Priority Queue
• Seed weight
• Smallest interval
• User Query distribution
• User Query importance
• Webpage change characteristics
- 15. Simplify
• IO module
– Synchronous block
• No Middleware supported
• No Item Loader
• No Framework
• No …