從蟒蛇到神龍 - 從 1 接關繼續打造爬蟲程式

=>
Half hour of code:
Joe @ Taichun.py 2016.01.09

• PyConTW
HoC
•
•
• …

•
• HTTP / HTML / CSS / JS python
•
• DEMO
•

Crawler
•
• JS
•
• JS
•
• JS
•
• JS
•
• BUG
•
•
HoC

Crawler
•
• JS
•
• JS
•
• JS
•
• JS
•
• BUG
•
•

STEP 1:
• 1.1
• whois
•
• Python whois module
• online service
• cmd tools

STEP 1:
• 1.2
• robot.txt sitemap.xml
• ….
•
• HTTP GET
• Python robotparser module parse
•

STEP 1:
• 1.3
•
•
• Python builtwith module
• Browser

STEP 1:
• 1.4 (optional)
•
•
• google: Kali Linux

STEP 2:
• 2.0
• XD
• HoC
•
• Python requests module
• curl … httpie
• API

STEP 3:
•
•
• pattern …
• regular expression
• Python re module regex101
•
•

STEP 3:
•
•
• BeautifulSoup lxml parse
• HoC
• BeautifulSoup parser: html.parser / lxml / lxml-
xml / html5lib
•
• parser

STEP 3:
•
• BJ4
• http -b www.google.com | hxnormalize -x
| hxselect -c 'title'

STEP 3:
•
•
• scrapely “train / learn”
• scrapy =>
• scapy =>
•
•

STEP 1:
•
•
• View Source Code vs Element View (chrome)

STEP 2 :
•
•
• javascript implement
•
• JS render
• WebView
•
• headless

STEP 2 :
• JS render
• WebView
• Python Binding
• PyQt or PySide … ( )
•
• Selenium Python
• headless
• Phantomjs( Casperjs) Slimerjs …

•
• img alt
• OCR
• pytesseract or pytesser
• xx learning + ….
• XD captcha
•

• python threading /
multiprocessing coroutine module
• browser automation
• cookies handoff

BLOCK
@
@
IE ONLY
@
SPIDERTRAP
@
HEADLESS MODE JS EVENT

•
• K-12
• /
• /
• /
•
• GAE (python)
• backbone.js / react.js
• AWS
• SCRUM

Contenu connexe