8323 Stats - Lesson 1 - 02 Introduction General 2008
1. STATISTICS FOR ECONOMICS AND BUSINESS The course I loved to hate… ( S.B. )
2.
3.
4.
5. STATISTICS FOR ECONOMICS AND BUSINESS Assessment Methods For attending students the course grade is based on: The analysis of a real data set (Pc-lab session – 4 hours). Here the focus is on the proper use of statistical techniques and adequacy of economic conclusions drawn on the basis of the obtained results. Documents with SAS procedures can be used during the exam (no other material is allowed). A written exam concerning the methodological issues discussed during the course (content of the theoretical slides). The two exams will be graded separately (max grades = 21 and 6 respectively) 2 Assignments – group work Lessons (at least 2) dedicated to discussion of the 2 assignments. All groups members present at discussion. In these lessons one person picked at random for each group will illustrate (part of) the obtained results (material may be consulted). If the group-person answer reasonably, the assignment of the group will be graded ( 0-2 for each assignment). Otherwise, 0 . for all group members. Not attending students (did not hand in both assignments): extended practical and theoretical exams (max grades=23 and 8 respectively)
6.
7. Multivariate Data Analysis Techniques to analyze/synthesize data sets with many variables and/or many observations. MOTIVATION
8. Multivariate Data Analysis – Motivation Example1. Innovation and Research in Europe (Source: Eurostat) Country code Geo Country name Country european region Region E-government on-line availability - Online availability of 20 basic public services E_gov_avail Exports of high technology products as a share of total exports HT_Exports % of males 20-24 having completed at least upper 2° educ. Y_Educ__Lev_m % of fem. 20-24 having completed at least upper 2° educ. Y_Educ_Lev_f Youth education attainment level - total - % of the population 20-24 who completed at least upper secondary education Y_Educ_Lev Expenditure on Telecommunications as a % of GDP Telec_Expenditure Expenditure on Information Technology as a % of GDP IT_Expenditure No patents granted by the US Patent and Trademark Office per million inhabitants USTPO No patent applications to the European Patent Office per million inhabitants EPO Male tertiary graduates in S&T per 1000 of males aged 20-29 ST_grad_m Female tertiary graduates in S&T per 1000 of females aged 20-29 ST_grad_f Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29 ST_grad Level of Internet access - % of households who have Internet access at home Internet_Acc GERD - abroad - % of GERD financed by abroad GERD_abroad GERD - government - % of GERD financed by government GERD_govern GERD - industry - % of GERD financed by industry GERD_industry Gross domestic expenditure on R&D (GERD) - As a % of GDP GERD Spending on Human Resources (total public expen. on education) - % of GDP Educ_Exp
9. Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. For the sake of simplicity, we limit attention to few observations and to few variables, transformed so that variables have all the same unit of measurement (we will show later how we obtain this result) How can we study the relationships among all the variables to understand which are the main tendencies of data, i.e. if there are groups of variables acting in the same or in the opposite direction? 0.84 0.21 0.15 1.41 -0.33 -0.14 0.07 0.04 0.47 0.51 Western France 0.11 0.00 1.53 -0.69 0.92 -1.01 -0.47 1.02 0.72 -0.47 Western Germany -0.62 -0.84 0.13 -0.25 0.44 0.77 -1.38 0.83 0.36 0.63 Western Belgium 0.53 -1.04 1.05 -0.97 1.25 0.56 -0.04 -0.16 0.09 -0.18 Western Netherlands -0.83 0.56 -0.86 0.01 -0.33 -0.05 0.36 -0.56 -0.75 -0.81 Southern Spain -0.62 0.42 -0.39 -0.82 -0.33 -0.60 1.03 -0.56 -0.58 -0.45 Southern Italy -0.72 -1.04 -1.05 -0.71 -1.14 1.94 1.01 -1.77 -1.01 -1.10 Southern Greece 0.01 1.88 1.47 0.27 1.54 -0.85 -1.45 1.52 2.42 1.79 Northern Sweden 0.74 1.39 1.61 1.03 0.49 -1.01 -1.04 1.46 1.52 1.00 Northern Finland 1.57 0.84 -0.03 1.56 0.72 2.20 -0.72 -0.70 0.12 0.08 Northern United Kingdom -0.93 0.63 0.07 -0.76 0.92 -0.16 0.35 -0.18 -0.10 1.91 Northern Norway 2.20 0.21 -0.43 1.60 -0.04 -0.36 -1.03 1.11 -0.57 -0.72 Northern Ireland -1.25 -0.49 -1.11 0.51 -1.38 -0.25 1.95 -1.42 -0.98 -0.09 Northern Lithuania -0.20 -1.18 -1.03 -1.08 -1.05 -1.07 0.72 -0.11 -0.48 -0.60 Eastern Czech Republic -0.83 -1.53 -1.12 -1.11 -1.67 0.04 0.66 -0.52 -1.24 -1.51 Eastern Romania HT_Exports E_gov_avail EPO ST_grad Internet_Acc GERD_abroad GERD_govern GERD_industry GERD Educ_Exp region country
10. Multivariate Data Analysis – Motivation 2) Obtain a line plot for VARIABLES Example1 (continued). Innovation and Research in Europe (subset) How can we study the relationships among all the variables? A line is associated to each variable. We can observe groups of vars with similar tendencies with respect to some variables, for example the orange-red ones, or the green ones or the blue ones. These three groups of vars show different tendencies
11. Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. (subset) How can we combine the information provided by all the vars to compare innovation/ research performance for each country? Should we consider the means for the previously observed groups OF VARIABLES? Are they sufficient to explain ALL the vars? Should we consider the 3 means, one for each group and compare obs on the basis of them? Which is the most important index/mean? Should the 3 indices have the same weight when comparing variables? What if we want a single index? Is it possible, how much information we loose? Group 1: GERD, GERD_industry, Internet_Acc, EPO, Educ_Exp, E_gov_avail Group 2: ST_grad, HT_Exports Group 3 : GERD_govern
12. Multivariate Data Analysis – Motivation Things become complicated when we consider more vars/obs. FINDING GROUPS OF VARIABLES WITH SIMILAR PATTERN IS DIFFICULT Example1 (continued). Innovation and Research in Europe. How can we study the relationships among all the variables?
13.
14. Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. (subset) How can we describe the main tendencies of European countries with respect to innovation? Are there countries with similar characteristics? Which are the main pattern/profiles in this data set? Obtain a line plot FOR OBSERVATIONS A line is associated to each observation . We can observe groups of obs with similar tendencies (for example the orange-red ones). Tendencies are similar only with respect to some vars. Which vars should be mostly considered? Who is “close” to who? How can we describe in a simple way similarity or dissimilarity between countries?
15. Multivariate Data Analysis – Motivation Sometimes the grouping is obtained on the basis of a priori knowledge. In this case, for example, we can group by referring to the region Example1 (continued). Innovation and Research in Europe (subset) How can we individuate groups of cases (countries) with similar characteristics? Grouping obs according to the region is not a good idea: countries in the same region show different patterns.
16. Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. How can we describe the main tendencies of European countries wrt innovation? Things become complicated when we consider more vars/obs. FINDING GROUPS OF OBSERVATIONS WITH SIMILAR PATTERNS IS DIFFICULT
17.
18. Multivariate Data Analysis – Motivation Example 2. Information about projects financed by EU in 1995-1996 Number of organisations involved in the project Size Topic of the project Topic Information about the P roject Number of projects coordinated by the Responsible before 1995 Proj_resp_1995 Number of projects coordinated by the Responsible ended before 1995 Proj_resp_end_1995 Evaluation of the activity of the Responsible as a partner in other projects before 1995 (8 point scale; 1=very poor, 8=excellent) Activity_partner Information about the Responsible (organization which is coordinating the project) Project id Record Duration EMP REV Type Country Duration of the project Employees of the Responsible Revenues of the Responsible Type of organisation (Industry, Education, Research, Commercial) of the Responsible Nationality of the responsible
19. Multivariate Data Analysis Example 2 (continued). Projects financed by EU in 1995-1996 (partial input) Is there an association between the country, the type of organization and the topic? Are there organizations/countries specialized in particular topics? If there is association, what is it due to? Who is attracted by what? STANDARDS 5 30 0 3 1 2633 248824 Industry UK 27410 TELECOMMUNICATIONS 4 18 1 1 1 1394 208312 Industry Netherlands 24175 TELECOMMUNICATIONS 3 18 4 6 2 363 18947 Industry UK 24174 TELECOMMUNICATIONS 6 24 0 2 2 259 5706 Industry Italy 24171 SAFETY 7 33 0 1 1 199 23859 Industry Italy 23988 SAFETY 4 24 0 7 7 53164 15297220 Education UK 23985 SAFETY 10 24 0 10 7 594 168066 Non Commercial France 23806 NATURAL_RESOURCES 6 36 0 2 2 12 974 Research France 23770 ENERGY 4 24 1 3 6 10343 4547875 Education Germany 23682 ENERGY 3 24 0 1 2 78701 15930801 Education Germany 23611 NATURAL_RESOURCES 7 36 0 1 2 163 18400 Research Netherlands 23601 NATURAL_RESOURCES 5 24 0 3 1 572 99404 Research UK 23596 ENERGY 5 18 1 6 6 34217 9969376 Industry Italy 23590 MATERIALS TECHNOLOGY 15 24 2 10 6 310 39707 Education Belgium 23386 MATERIALS TECHNOLOGY 6 24 0 1 2 49 6353 Research France 23376 TOPIC SIZE DURATION PROJ_RESP_END_1995 PROJ_RESP_ 1995 ACTIVITY_ PARTNER EMP REV TYPE COUNTRY RECORD
20.
21.
22. Multivariate Data Analysis The aim of Multivariate Statistical Techniques is to Extract information contained in a given data set, by simplifying and summarizing observations and/or variables by using DATA DRIVEN TOOLS The tool – i.e., the compression/simplification/synthesis of data – used to make information available depends upon the aim of the analysis and on the nature of the variables taken into account