SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Viller Hsiao
⽤用 Python 抓取財報資訊
• 練習 python
• 練習 好好寫 python
⽤用 Python 抓取財報資訊
• 練習 python
• 練習好好寫 python
• 了解 web 架構
• 計算股票價值
Steps
• 抓網⾴頁
• 解析內容
• 資料計算
資料來源
表格別 股票id
檢查元素
開發⼈人員⼯工具
• 練習 google python style guide
中年Py的奇幻漂流
http://static.ettoday.net/images/206/206484.jpg
Python Modules
• Parse DOM
• urllib + SGMLParser
• requests + BeautifulSoup4
• Excel
• xlutils
urllib
url = ‘http://jdata.yuanta.com.tw/z/zc/zch/zch_2330.djhtm'
webcode = urllib.urlopen(url)
if webcode.code == 200:
self.webpage = webcode.read()
webcode.close()
SGMLParser
class AccountTable(SGMLParser):
def feed(self, data):
def start_tr(self, attr):
def end_tr(self):
def handle_data(self):
Oops
def start_table(self, attrs):
if len(attrs) > 0:
for at in attrs:
if at[0] == 'id' and at[1] == 'oMainTable':
self.isTargetTbl = True
中⽂文轉碼
line.encode(‘big5’).decode(‘utf8’)
v2.0
• Coding style refinement
• google python style guide
• pyhon 慣⽤用語
g0v 專案
requests
import requests
def parse_url(url):
r = requests.get(url)
if r.status_code == requests.codes.ok:
parse_html(r.text)
BeautifulSoup
from bs4 import BeautifulSoup
def parse_html(html_text):
soup = BeautifulSoup(html_text)
rows = soup.find(‘table', class=‘t01’)
rows = rows.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [e.text.encode('utf-8').strip() for e in cols]
data.append(cols)
<td class="t3n0">104/05</td><td class="t3n1">70,154,763</td>
Future Plan
• concurrent / gevent
• fake browser header
• free proxy

Contenu connexe

Similaire à My first-crawler-in-python

Python - A Comprehensive Programming Language
Python - A Comprehensive Programming LanguagePython - A Comprehensive Programming Language
Python - A Comprehensive Programming LanguageTsungWei Hu
 
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스Rhio Kim
 
Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012discoversudhir
 
Twitter bootstrap2.0 taste
Twitter bootstrap2.0 tasteTwitter bootstrap2.0 taste
Twitter bootstrap2.0 tasteTencent
 
How dojo works
How dojo worksHow dojo works
How dojo worksAmit Tyagi
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰Wayne Chen
 
Web前端性能优化 2014
Web前端性能优化 2014Web前端性能优化 2014
Web前端性能优化 2014Yubei Li
 
Django Overview
Django OverviewDjango Overview
Django OverviewBrian Tol
 
Mezzanine簡介 (at) Taichung.py
Mezzanine簡介 (at) Taichung.pyMezzanine簡介 (at) Taichung.py
Mezzanine簡介 (at) Taichung.pyMax Lai
 
Scaling business app development with Play and Scala
Scaling business app development with Play and ScalaScaling business app development with Play and Scala
Scaling business app development with Play and ScalaPeter Hilton
 
Write Less Do More
Write Less Do MoreWrite Less Do More
Write Less Do MoreRemy Sharp
 
Drools and jBPM 6 Overview
Drools and jBPM 6 OverviewDrools and jBPM 6 Overview
Drools and jBPM 6 OverviewMark Proctor
 
Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012discoversudhir
 
スマートフォンサイトの作成術 - 大川洋一
スマートフォンサイトの作成術 - 大川洋一スマートフォンサイトの作成術 - 大川洋一
スマートフォンサイトの作成術 - 大川洋一okyawa
 
Client-side MVC with Backbone.js
Client-side MVC with Backbone.js Client-side MVC with Backbone.js
Client-side MVC with Backbone.js iloveigloo
 
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...mohamed hadrich
 
Python在豆瓣的应用
Python在豆瓣的应用Python在豆瓣的应用
Python在豆瓣的应用Qiangning Hong
 

Similaire à My first-crawler-in-python (20)

Python - A Comprehensive Programming Language
Python - A Comprehensive Programming LanguagePython - A Comprehensive Programming Language
Python - A Comprehensive Programming Language
 
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
 
Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012
 
Flutter 4
Flutter 4Flutter 4
Flutter 4
 
Twitter bootstrap2.0 taste
Twitter bootstrap2.0 tasteTwitter bootstrap2.0 taste
Twitter bootstrap2.0 taste
 
How dojo works
How dojo worksHow dojo works
How dojo works
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
Web前端性能优化 2014
Web前端性能优化 2014Web前端性能优化 2014
Web前端性能优化 2014
 
Django Overview
Django OverviewDjango Overview
Django Overview
 
Mezzanine簡介 (at) Taichung.py
Mezzanine簡介 (at) Taichung.pyMezzanine簡介 (at) Taichung.py
Mezzanine簡介 (at) Taichung.py
 
Scaling business app development with Play and Scala
Scaling business app development with Play and ScalaScaling business app development with Play and Scala
Scaling business app development with Play and Scala
 
Write Less Do More
Write Less Do MoreWrite Less Do More
Write Less Do More
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
Protostrap
ProtostrapProtostrap
Protostrap
 
Drools and jBPM 6 Overview
Drools and jBPM 6 OverviewDrools and jBPM 6 Overview
Drools and jBPM 6 Overview
 
Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012Boss hack u-iit-madras-2012
Boss hack u-iit-madras-2012
 
スマートフォンサイトの作成術 - 大川洋一
スマートフォンサイトの作成術 - 大川洋一スマートフォンサイトの作成術 - 大川洋一
スマートフォンサイトの作成術 - 大川洋一
 
Client-side MVC with Backbone.js
Client-side MVC with Backbone.js Client-side MVC with Backbone.js
Client-side MVC with Backbone.js
 
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
Fetch Company's statististics from Yahoo Finance and save it info a Google Sh...
 
Python在豆瓣的应用
Python在豆瓣的应用Python在豆瓣的应用
Python在豆瓣的应用
 

Plus de Viller Hsiao

Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bccViller Hsiao
 
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyPrerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyViller Hsiao
 
twlkh-linux-vsyscall-and-vdso
twlkh-linux-vsyscall-and-vdsotwlkh-linux-vsyscall-and-vdso
twlkh-linux-vsyscall-and-vdsoViller Hsiao
 
Linux kernel tracing
Linux kernel tracingLinux kernel tracing
Linux kernel tracingViller Hsiao
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingViller Hsiao
 
mbed-os 3.0 modules dependency graph
mbed-os 3.0 modules dependency graphmbed-os 3.0 modules dependency graph
mbed-os 3.0 modules dependency graphViller Hsiao
 
Introduction to ARM mbed-OS 3.0 uvisor
Introduction to ARM mbed-OS 3.0 uvisorIntroduction to ARM mbed-OS 3.0 uvisor
Introduction to ARM mbed-OS 3.0 uvisorViller Hsiao
 
Yet another introduction to Linux RCU
Yet another introduction to Linux RCUYet another introduction to Linux RCU
Yet another introduction to Linux RCUViller Hsiao
 
Trace kernel code tips
Trace kernel code tipsTrace kernel code tips
Trace kernel code tipsViller Hsiao
 
f9-microkernel-ktimer
f9-microkernel-ktimerf9-microkernel-ktimer
f9-microkernel-ktimerViller Hsiao
 

Plus de Viller Hsiao (10)

Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bcc
 
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyPrerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrency
 
twlkh-linux-vsyscall-and-vdso
twlkh-linux-vsyscall-and-vdsotwlkh-linux-vsyscall-and-vdso
twlkh-linux-vsyscall-and-vdso
 
Linux kernel tracing
Linux kernel tracingLinux kernel tracing
Linux kernel tracing
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
mbed-os 3.0 modules dependency graph
mbed-os 3.0 modules dependency graphmbed-os 3.0 modules dependency graph
mbed-os 3.0 modules dependency graph
 
Introduction to ARM mbed-OS 3.0 uvisor
Introduction to ARM mbed-OS 3.0 uvisorIntroduction to ARM mbed-OS 3.0 uvisor
Introduction to ARM mbed-OS 3.0 uvisor
 
Yet another introduction to Linux RCU
Yet another introduction to Linux RCUYet another introduction to Linux RCU
Yet another introduction to Linux RCU
 
Trace kernel code tips
Trace kernel code tipsTrace kernel code tips
Trace kernel code tips
 
f9-microkernel-ktimer
f9-microkernel-ktimerf9-microkernel-ktimer
f9-microkernel-ktimer
 

My first-crawler-in-python