21. Challenges of Web Crawling
1. 網路問題 (IP 被封鎖、Proxy 沒開啟、Timeout)
2. 對方 Server 有限制 User-Agent
3. Deep web 問題 (你完全忘了需要登入才能看到)
4. <html> parser 寫錯?
5. 回傳格式找不到 Repeated content
6. Database 哪種適合?
- Non-relational and schema-less data model
- Low latency and high performance
- Highly scalable
21
50. Thanks for your listening
Contact Info: elliot79313@gmail.com
50
51. Reference:
1. Pant, Gautam, Padmini Srinivasan, and Filippo Menczer. "Crawling the web." Web Dynamics. Springer
Berlin Heidelberg, 2004. 153-177.
2. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: a survey." Knowledge-based
systems 70 (2014): 301-323.
3. “Crawling”, http://slideplayer.com/slide/7572783/
51