Durant cette présentation, nous introduirons des concepts de bases de la science de la donnée et discuterons d’un projet réalisé chez un de nos client.
Nous découvrirons, comment on peut facilement réaliser des projets de science de la donnée à l’aide du langage de programmation statistique R, ainsi que de son intégration dans la nouvelle suite de Microsoft SQL Server 2016.
4. Why Data Science?
Raw Data
Operational
Reporting
Descriptive
Analytics
Predictive
Analytics
Prescriptive
Analytics
Analytic maturity
Value
5. What is Data Science?
Domain
Expertise
Math &
Stats
Computer
Science
6. CRISP-DM (CRoss Industry Standard Process - Data Mining)
Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining.
Journal of Data Warehousing, 5(4), 13–22.
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
DATA
8. CRISP-DM (CRoss Industry Standard Process - Data Mining)
Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining.
Journal of Data Warehousing, 5(4), 13–22.
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
DATA
9. What are ECTS
(European Credit Transfer and Accumulation System)?
Obtain
ECTS
Credits
Pass
exams
Take
exams
Follow
courses
10. How does the studies at the university works?
Mathematics
(120 ECTS)
Computer Science
(60 ECTS)
Bachelor in Mathematics
Philosophy
(60 ECTS)
Biology
(60 ECTS)
Mathematics
(90 ECTS)
Master in Mathematics
11. How does the studies at the university works?
3 years
180 ECTS
1.5 years
90 ECTS
30 ECTS are equivalent to a full-
time study load for one semester
12. What is the study intensity?
The study intensity of a student for a given semester is the number of ECTS
credits this students gets evaluated in the semester.
-------------------------------------------------------------------------------------------------------
Example. Dominique Duay follows the courses:
1. Introduction to Machine Learning (4 ECTS)
2. Macroeconomics (6 ECTS)
3. Data Analysis and Statistics with R (8 ECTS)
Dominique takes the exams for the first two only, his study intensity for the first
semester is 10 ECTS.
The next semester he decides to take the exam of the third course, this will add 8 ECTS
to its study intensity of the next semester.
13. What is the big deal about the study intensity?
The average study intensity across study paths, programs, faculties, levels
is not the same and varies significantly. It is not clear why…
It is felt that the study intensity is somehow linked to the reputation of the
studies.
The number of ECTS evaluated per year is somehow correlated to the
budget the university will receive from the confederation.
The Swiss Confederation started few years ago to monitor more closely the
study intensity of Universities.
14. Strategy
Identify variables correlated to low/high study
intensity
Predict which students will have a low study
intensity
Increase study intensity with concrete actions
16. What is R?
Programming language for statistical computing and graphics
Interpreted language (access it through the console)
Open source and used by researchers, statisticians and data miners all
around the world
Features > 9000 libraries on the CRAN repository
Runs in memory (mostly…)
17. CRISP-DM (CRoss Industry Standard Process - Data Mining)
Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining.
Journal of Data Warehousing, 5(4), 13–22.
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
DATA
18. The Data
DWH fact table and related dimensions
1 line = number of ECTS evaluated per student, per course, per professor…
We aggregate this data to have one line per student, per semester
We take data from 2012-2015
20. OFS Report on study intensity
Significant independent variables we already
know:
Age
Level (Bachelor, Master)
Major (Economy, Law, Medicine, …)
23. Comparison between models
Root Mean Square Error:
0 < RMSE < 𝜎
(the smaller, the better…)
R Squared:
𝑅2
< 1
(the closer to 1, the better…)
24. Comparison between models
Root Mean Square Error:
0 < RMSE < 𝜎
(the smaller, the better…)
R Squared:
𝑅2
< 1
(the closer to 1, the better…)
Linear Regression.
RMSE = 12.2
𝑅2
= 0.37
Regression Tree.
RMSE = 11.6
𝑅2
= 0.43
25. CRISP-DM (CRoss Industry Standard Process - Data Mining)
Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining.
Journal of Data Warehousing, 5(4), 13–22.
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
DATA
28. CRISP-DM (CRoss Industry Standard Process - Data Mining)
Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining.
Journal of Data Warehousing, 5(4), 13–22.
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
DATA