Practical data analysis with wine

Practical data analysis with wine
　
December 2014
Toshifumi Kuga CEO of TOSHI STATS SDN. BHD.
beta version
1

Today’s menu
1. formula for prediction of wine price
2. data handling (vector & matrix）
3. liner regression model with R
2

Formula for prediction of wine price is public
• Dr. Orley Ashenfelter　
• He is a professor of economics at Princeton
University and was a president American Economic
Association in 2011
• The formula was public in 1990
3
1. formula of price prediction
http://www.liquidasset.com/winedata.html
Data is available here

Dr. Orley Ashenfelter’s formular
wine price＝-12.145+0.00117×amount of rain in winter+ 0.06163×average
temperature- 0.00386×amount of rain in harvest+ 0.02385×years from 1983
• parameters：θ=[ -12.145, 0.00117, 0.06163, -0.00386, 0.02385 ]
• input variables：X=[1, rain winter, average temp, rain harvest, years]
• wine price：Y=θ0+θ1×X1＋θ2×X2+θ3×X3+θ4×X4
• wine price can be represented as「Y=θX」
※ ‘wine price’ : ratio of average price of the year against the average price of 1961, and take log of the ratio
4
simplified in the explanation above

Step for prediction of wine price
• wine price：Y=θX
• Y : value to be predicted（Future wine price in this case、unknown value）
• X : known value（temperature in the past are known now）
• Parametersθis unknown
　　　　→ Ifθis obtained 、future wine price Y can be obtained, too！
• Y in the past is also known（wine price in the past is known）
　　　→　X and Y in the past are available as a set → θcan be obtained
5

Data used in the analysis
OBS VINT Y:LPRICE2 X1:WRAIN X2:DEGREES X3:HRAIN X4:TIME_SV
1 1952 -0.99868 600 17.1167 160 31
2 1953 -0.4544 690 16.7333 80 30
3 1954 430 15.3833 180 29
4 1955 -0.80796 502 17.15 130 28
5 1956 440 15.65 140 27
… … … … … … …
35 1986 563 16.2833 171 -3
36 1987 452 16.9833 115 -4
37 1988 808 17.1 59 -5
38 1989 443 82 -6
Y X
6

How to obtainθ: Least square method
• Compared predictions with observed value（value it the
past), parametersθcan be obtained so that square of
deferences can be minimize
• There are programs (algorithms) that calculations
automatically are executed in the computers
• In practice, we rarely calculate parameters manually（In
practice, it can not be solved manually）
7

　Parameter calculations by computer
8
Value in the past
Y Parameter calculation Price prediction model
θ
X
Y=θX

θand X are not “ just a number ”
• gathering of numbers
• It can be represented as vectors and matrixes in math
• Massive amount of data can be represented by vectors
and matrices with ease！
• Data can be handled as vectors and matrices in computers
• Major program language, such as R, MATLAB, python can
prepare vectors and matrices and control them effectively
9

Be familiar with vectors and matrices！
• You can handle data as you like
• You can program it by yourself
• First step for practical data analysis
10

2. Data handling（vector&matrix）
　Math in high school is important !
• Arithmetic is mainly explained
• No more than +, -, ×, /
• Exercise manually until getting familiar
with vectors and matrices
• Let us verify the results by using R
11

　　 Vector : one line
• either vertical or horizontal
［1 3 7］［5 13 ］
12
］
］5 1
b=c(5,13)
d=c(1,5)
a=c(1,3,7)
Blue: Verify the results by R language

　　　vector : addition
［1 3 7］
＋＝
6 7
2 7 ＋＝
13
］
］5 1
［２４ 1］
］
［3 7 8］
］
］
］
a=c(1,3,7)
b=c(2,4,1)
a+b
a=c(1,5)
b=c(6,2)
a+b

　　　vector : subtraction
［1 ４ 7］
- ＝
6 -５
2 ３ - ＝
14
］
］5 1
［２３ 1］
］
［-１１ 6］
］
］
］
a=c(1,4,7)
b=c(2,3,1)
a-b
a=c(1,5)
b=c(6,2)
a-b

　　　vector : scalar multiplication
3 ［２４ 1］
6 12
2 4 2 × ＝
15
］
［6 12 3］
］
× ＝
］
］
a=c(2,4,1)
3*a
b=c(6,2)
2*b

　vector : multiplication (inner product)　
× ＝
16
［２４ 1］
］3
6 32
2
］
a=c(2,4,1)
b=c(3,6,2)
2×3+ 4×6+1×2 =32 a%*%b

　　　Matrix : rectangular shape
a=matrix(c(1,3,2,4),2,2) ２×２２×２３×２
• dimension：number of rows × number of columns　（m×n）
17

　　　Matrix : elements
• elements (entries)
first row first column：１
second row first column：３
first row and second column：２
second row and second column：４
18

　　　Matrix : addition
＋ =
19
＋
=

　　　Matrix : subtraction
ー=
20
ー
=

　　Matrix:scalar multiplication/division
21
=
×
=
2 ×
/ 2 = 1/2

　　　Matrix : multiplication
22
×
×
=
=
a little
complicated?

　　　Let us see it more details !
［× 52 49］
［１ 2］ 3 4
［
a=matrix(c(1,3,2,4),2,2)
b=matrix(c(2,5,9,4),2,2)
1×9+ 2×4 =17
3×9+ 4×4 =43］＝
23
＝
a%*%b
［ 17］ 26 43
1×2+ 2×5 =12 12
3×2+ 4×5 =26

　Matrix multiplication : not commutative
24
×
×
× = ×
=

　　　vector : multiplication　2
［２４］
25
］3
］6 × ＝［6 12］ 12 24
a=matrix(c(3,6),2,1)
b=c(2,4)
a%*%b

identity matrix
• Diagonal elements are 1
• Any other elements are 0
• In multiplication with identity matrix,
nothing is changed
× = × =
26
diag(2)

inverse matrix
• If A is m×m matrix and if A has an inverse
matrix AA=AA=I I : identity matrix
-１ -１
× = × =
a=matrix(c(1,3,2,4),2,2)
> a
[,1] [,2]
[1,] 1 2
[2,] 3 4
27
> inv=solve(a)
> inv
[,1] [,2]
[1,] -2.0 1.0
[2,] 1.5 -0.5
-１=
-１ -１

transpose matrix
• exchange elements of row and column
28
a=matrix(c(1,3,2,4),2,2)
t(a)
=
T

Least squares estimation
• Vector and matrix are used in programming least squares estimation
• J = 1/(2*m) * T(X*θ-Y)*(X*θ-Y)：cost function (Squared error function)
• m : number of sample data
• X is a matrix, Y is a vector、θ is a parameter vector
• T（）means transpose matrix
• θ can be obtained so that J is minimized ( deference between predictions
29
and real value can be minimized)
→ Least squares estimation

analysis by liner regression model “lm”
> wineprice=lm(LPRICE2~WRAIN+DEGREES+HRAIN+TIME_SV, data=wine)
> wineprice
input variables
30
3. Liner regression with R
After lm, put a variable to be predicted 、then ”~” and input variables、data=name of data file
> ▼▼▼=lm(◯◯◯~△△△+■■■, data=◎◎◎)
> ▼▼▼
a variable to be
predicted
http://www.liquidasset.com/winedata.html Data is available here

　　　Parameters can be obtained！
• Call:
• lm(formula = LPRICE2 ~ WRAIN + DEGREES + HRAIN + TIME_SV,
Let us compare them with formula of
31
data =wine)
prediction of wine price
• Coefficients:
• (Intercept) WRAIN DEGREES HRAIN TIME_SV
• -12.145007 0.001167 0.616365 -0.003861 0.023850

32
RStudio
see p38

33
□　prediction
◯　real price
predict(wineprice,data.frame(wine))

analyze data by functions automatically
• By function ‘lm’, parameters can be obtained with one line command
• There are a lot of of functions in R. we can analyze data by these functions
without wring functions by ourselves.
• However we should understand how calculations are done in functions
broadly. Blackbox approach is not recommendedただし、
• More we can understand functions, better we can select the functions for
particular cases to solve
• Let us be familiar with ‘lm’. Then you can understand other functions with ease
34

recommender systems
• amazon.com and Netflix are famous for
recommendations
• a variety of recommendations
• Recommend the most popular product
→same recommendation for everyone
• Recommend the best products for the
individual customer
→need for personalization method !
35

Personalization
• example of method for personalized recommendations
• θ：customers’ preference（click the products or not？
provide the rating or not？）
• X：items features（in the case of movies：holler? romance?
SF?・Who is the director, actor, actress?・When and where is
it created?）
• Obtain probabilities based on θX by logistic regression model
• If probability is high, recommendations of the item are
provided to the customer
36

Quandl：data source
37
• Over 10M data is
available for free
• Data can be
downloaded
directly to R、
MATLAB、python
https://www.quandl.com

Website of R and RStudio
• R is a language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. ISBN
3-90005107-0 URL http://www.R-project.org
• I prepare short movie about how to use R.
http://www.toshistats.net/introduction-to-r-language/
• RStudio is one of the best IDE for R.
http://www.rstudio.com/products/rstudio/download/
38

Thanks for your attentions
• TOSHI STATS SDN. BHD, Digital-learning center for statistical computing in Asia
• CEO : Toshifumi Kuga, Certified financial services auditor
• Company website : www.toshistats.net
• Company FB page : www.facebook.com/toshistatsco
• Company blog : http://toshistats.wordpress.com/aboutme/
• Company blog is updated on AM 10:00 every Thursday and reports the latest
information about data analysis ! Please look at this blog or Company website.
39

Disclaimer
• TOSHI STATS SDN. BHD. and I do not accept any responsibility or
liability for loss or damage occasioned to any person or property
through using materials, instructions, methods, algorithm or ideas
contained herein, or acting or refraining from acting as a result of
such use. TOSHI STATS SDN. BHD. and I expressly disclaim all
implied warranties, including merchantability or fitness for any
particular purpose. There will be no duty on TOSHI STATS SDN.
BHD. and me to correct any errors or defects in the codes and the
software
© 2014 TOSHI STATS SDN. BHD. All rights reserved
40

Practical data analysis with wine

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Practical data analysis with wine

Similaire à Practical data analysis with wine (20)

Plus de TOSHI STATS Co.,Ltd.

Plus de TOSHI STATS Co.,Ltd. (6)

Dernier

Dernier (20)

Practical data analysis with wine