Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Practical data analysis with wine
1. Practical data analysis with wine
December 2014
Toshifumi Kuga CEO of TOSHI STATS SDN. BHD.
beta version
1
2. Today’s menu
1. formula for prediction of wine price
2. data handling (vector & matrix)
3. liner regression model with R
2
3. Formula for prediction of wine price is public
• Dr. Orley Ashenfelter
• He is a professor of economics at Princeton
University and was a president American Economic
Association in 2011
• The formula was public in 1990
3
1. formula of price prediction
http://www.liquidasset.com/winedata.html
Data is available here
4. Dr. Orley Ashenfelter’s formular
wine price=-12.145+0.00117×amount of rain in winter+ 0.06163×average
temperature- 0.00386×amount of rain in harvest+ 0.02385×years from 1983
• parameters:θ=[ -12.145, 0.00117, 0.06163, -0.00386, 0.02385 ]
• input variables:X=[1, rain winter, average temp, rain harvest, years]
• wine price:Y=θ0+θ1×X1+θ2×X2+θ3×X3+θ4×X4
• wine price can be represented as「Y=θX」
※ ‘wine price’ : ratio of average price of the year against the average price of 1961, and take log of the ratio
4
1. formula of price prediction
simplified in the explanation above
5. Step for prediction of wine price
• wine price:Y=θX
• Y : value to be predicted(Future wine price in this case、unknown value)
• X : known value(temperature in the past are known now)
• Parametersθis unknown
→ Ifθis obtained 、future wine price Y can be obtained, too!
• Y in the past is also known(wine price in the past is known)
→ X and Y in the past are available as a set → θcan be obtained
5
1. formula of price prediction
7. How to obtainθ: Least square method
• Compared predictions with observed value(value it the
past), parametersθcan be obtained so that square of
deferences can be minimize
• There are programs (algorithms) that calculations
automatically are executed in the computers
• In practice, we rarely calculate parameters manually(In
practice, it can not be solved manually)
7
1. formula of price prediction
8. 1. formula of price prediction
Parameter calculations by computer
8
Value in the past
Y Parameter calculation Price prediction model
θ
X
Y=θX
9. θand X are not “ just a number ”
• gathering of numbers
• It can be represented as vectors and matrixes in math
• Massive amount of data can be represented by vectors
and matrices with ease!
• Data can be handled as vectors and matrices in computers
• Major program language, such as R, MATLAB, python can
prepare vectors and matrices and control them effectively
9
1. formula of price prediction
10. Be familiar with vectors and matrices!
• You can handle data as you like
• You can program it by yourself
• First step for practical data analysis
10
1. formula of price prediction
11. 2. Data handling(vector&matrix)
Math in high school is important !
• Arithmetic is mainly explained
• No more than +, -, ×, /
• Exercise manually until getting familiar
with vectors and matrices
• Let us verify the results by using R
11
12. 2. Data handling(vector&matrix)
Vector : one line
• either vertical or horizontal
[1 3 7] [5 13 ]
12
]
]5 1
b=c(5,13)
d=c(1,5)
a=c(1,3,7)
Blue: Verify the results by R language
17. 2. Data handling(vector&matrix)
Matrix : rectangular shape
a=matrix(c(1,3,2,4),2,2) 2×2 2×2 3×2
• dimension:number of rows × number of columns (m×n)
17
18. 2. Data handling(vector&matrix)
Matrix : elements
• elements (entries)
first row first column:1
second row first column:3
first row and second column:2
second row and second column:4
18
26. 2. Data handling(vector&matrix)
identity matrix
• Diagonal elements are 1
• Any other elements are 0
• In multiplication with identity matrix,
nothing is changed
× = × =
26
diag(2)
27. inverse matrix
• If A is m×m matrix and if A has an inverse
matrix AA=AA=I I : identity matrix
-1 -1
× = × =
a=matrix(c(1,3,2,4),2,2)
> a
[,1] [,2]
[1,] 1 2
[2,] 3 4
27
> inv=solve(a)
> inv
[,1] [,2]
[1,] -2.0 1.0
[2,] 1.5 -0.5
-1=
-1 -1
2. Data handling(vector&matrix)
28. transpose matrix
• exchange elements of row and column
28
a=matrix(c(1,3,2,4),2,2)
t(a)
=
T
2. Data handling(vector&matrix)
29. Least squares estimation
• Vector and matrix are used in programming least squares estimation
• J = 1/(2*m) * T(X*θ-Y)*(X*θ-Y):cost function (Squared error function)
• m : number of sample data
• X is a matrix, Y is a vector、θ is a parameter vector
• T( )means transpose matrix
• θ can be obtained so that J is minimized ( deference between predictions
29
and real value can be minimized)
→ Least squares estimation
2. Data handling(vector&matrix)
30. analysis by liner regression model “lm”
> wineprice=lm(LPRICE2~WRAIN+DEGREES+HRAIN+TIME_SV, data=wine)
> wineprice
input variables
30
3. Liner regression with R
After lm, put a variable to be predicted 、then ”~” and input variables、data=name of data file
> ▼▼▼=lm(◯◯◯~△△△+■■■, data=◎◎◎)
> ▼▼▼
a variable to be
predicted
http://www.liquidasset.com/winedata.html Data is available here
31. 3. Liner regression with R
Parameters can be obtained!
• Call:
• lm(formula = LPRICE2 ~ WRAIN + DEGREES + HRAIN + TIME_SV,
Let us compare them with formula of
31
data =wine)
prediction of wine price
• Coefficients:
• (Intercept) WRAIN DEGREES HRAIN TIME_SV
• -12.145007 0.001167 0.616365 -0.003861 0.023850
33. 33
3. Liner regression with R
□ prediction
◯ real price
predict(wineprice,data.frame(wine))
34. analyze data by functions automatically
• By function ‘lm’, parameters can be obtained with one line command
• There are a lot of of functions in R. we can analyze data by these functions
without wring functions by ourselves.
• However we should understand how calculations are done in functions
broadly. Blackbox approach is not recommendedただし、
• More we can understand functions, better we can select the functions for
particular cases to solve
• Let us be familiar with ‘lm’. Then you can understand other functions with ease
34
3. Liner regression with R
35. recommender systems
• amazon.com and Netflix are famous for
recommendations
• a variety of recommendations
• Recommend the most popular product
→same recommendation for everyone
• Recommend the best products for the
individual customer
→need for personalization method !
35
36. Personalization
• example of method for personalized recommendations
• θ:customers’ preference(click the products or not?
provide the rating or not?)
• X:items features(in the case of movies:holler? romance?
SF?・Who is the director, actor, actress?・When and where is
it created?)
• Obtain probabilities based on θX by logistic regression model
• If probability is high, recommendations of the item are
provided to the customer
36
37. Quandl:data source
37
• Over 10M data is
available for free
• Data can be
downloaded
directly to R、
MATLAB、python
https://www.quandl.com
38. Website of R and RStudio
• R is a language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. ISBN
3-90005107-0 URL http://www.R-project.org
• I prepare short movie about how to use R.
http://www.toshistats.net/introduction-to-r-language/
• RStudio is one of the best IDE for R.
http://www.rstudio.com/products/rstudio/download/
38
39. Thanks for your attentions
• TOSHI STATS SDN. BHD, Digital-learning center for statistical computing in Asia
• CEO : Toshifumi Kuga, Certified financial services auditor
• Company website : www.toshistats.net
• Company FB page : www.facebook.com/toshistatsco
• Company blog : http://toshistats.wordpress.com/aboutme/
• Company blog is updated on AM 10:00 every Thursday and reports the latest
information about data analysis ! Please look at this blog or Company website.
39