17. 讀取網頁資訊 (pkg xml2)
library(xml2)
# set your target url
doc <- read_html(url)
# set the xpath of info needed
xpath <- “//*[@id=‘inquiry3’]/table//tr[4]/td[1]”
xml_text(xml_find_all(doc, xpath))
資料收集 - 講解 A-01
17
47. Summary Functions in R
Function Description 白話文
names() Functions to get or set the names of an object 看欄位名稱
head(), tail()
Returns the first or last parts of a vector, matrix, table,
data frame or function 看 前/(後) 幾筆資料
str() Compactly display the internal structure of an R object 物件屬性
summary() Produce result summaries 物件的基本數值狀態
dim() Retrieve or set the dimension of an object 矩陣大小
length() Get or set the length of vectors 向量長度
complete.cases()
Return a logical vector indicating which cases are
complete, i.e., have no missing values 回傳各元素 NA 邏輯值
as.Date()
Convert between character representations and
objects of class "Date" representing calendar dates 轉成日期型態
Function name and parameter 的縮寫解釋:
http://jeromyanglim.blogspot.tw/2010/05/abbreviations-of-r-commands-explained.html
48
48. Visualization Functions in R
Function Description 白話文
plot() Generic function for plotting of R objects 畫圖 (散布圖 or R object)
boxplot()
Produce box-and-whisker plot(s) of the given
(grouped) values 盒鬚圖
hist() Computes a histogram of the given data values 分布圖
barplot()
Creates a bar plot with vertical or horizontal
bars 長條圖
arrows() Draw arrows between pairs of points 加箭頭 (x0, y0, x1, y1)
abline()
a, b: the intercept and slope, single values.
y = [A] + [B]x 加一條截距為a, 斜率為b的直線
lines()
Join the corresponding points with line
segments. 折線圖
Function name and parameter 的縮寫解釋:
http://jeromyanglim.blogspot.tw/2010/05/abbreviations-of-r-commands-explained.html
49
49. session_B_eda.R
讀入資料與看一看變數
# load in apple daily article
> d <- read.csv(“df_article.csv”, fileEncoding =
“utf-8”)
# use dim() to know data frame dimension
> dim(d)
[1] 3779 12
# check the column names
> names(d)
[1] "aid" "case.closed" "date.funded" "date.published"
[5] "donation" "donor" "journalist" "n.image" "n.word"
[10] "title" "url.article" "url.detail"
EDA - 講解 B-02
50
62. # use plot to check relationship between numbers of donors
and total donation (and draw a linear line)
> plot(d$donor, d$donation, pch = ‘.’, cex = 2)
> y <- lm(donation ~ donor, data = d)
> abline(y, col = ‘red’, lwd = 1.5)
plot()
EDA - 講解 B-02
63
109. Skip-gram 模型
藉由 current word 推測 context words
Neural network model
Stochastic gradient descend (SGD)
112
|V|:詞庫內所有的字詞數量
d :字詞向量化的維度
1 x |V|
0.2
:
:
0.5
:
0.1
0
0
:
1
:
:
0
0
0.1
:
:
0.6
:
0.1
0
:
:
:
:
|V| x d
W
C
d x |V|
C
d x |V|
1 x d
wt
最熱的
Wt-1
夏天
Wt+1
季節
0
:
:
1
:
0
0
0
:
:
1
:
0
0
-
-
112
110. Continuous Bag of Words 模型
由 context words 推測 current word
113
0.2
0.1
0
:
0.5
:
0
1 x |V|
d x |V|
W
C|V| x d
C|V| x d
:
:
:
:
:
1 x d
0
1
:
0
:
0
0
0
0
:
1
:
0
0
wt
最熱的
Wt-1
夏天
Wt+1
季節
0
0
0
:
1
:
0
-
113
111. 詞向量
訓練結束後,將1 x |V| 字詞轉換成 d 維向量的
矩陣
|V| x d
W
|V| x d
1
2
:
i
:
:
|V|
1 2 … … … (d-2) (d-1) d
第 i 個詞的詞向量
114
112. 淺談 text2vec
Global corpus statistics + local window
Count-based method
Term co-occurrence matrix, X
Xij: 詞-i 和詞-j 共同出現次數
最熱的夏天 季節
Word,wt
Context words
Wt-1 Wt+1
1
2
⁞
i
⁞
|V|
1 … j … k … |V|
最熱的
夏
天
季
節
+1
Term Co-occurrence Matrix
+1
115