2. Presentation Outline
introduction
Data Structure
Cross-Sectional Data
Regression Diagnostics
Other Regression Commands
Presenting Your Results
Suggested readings
3. Introduction
What is Stata?
Stata is a general-purpose statistical software package created in 1985
by Stata Corp. Most of its users work in research, especially in the
fields of economics, sociology, political science, biomedicine and
epidemiology.
Why Stata?
Stata has been less popular than its market competitors, such as
SPSS and SAS, but it gaining in popularity every year.
It is particularly user-friendly when it comes to analyzing
complicated data sets.
4. Introduction (Cont’d)
How is Stata different?
The commands in Stata are much more intuitive and less fussy
regarding punctuation.
In Stata, it is possible to download new applications that were
written by users to perform specific tasks, and use them as
commands.
Dealing with longitudinal data sets with various different types of file
structures in Stata is much quicker and easier.
5. Introduction (Cont’d)
Windows in Stata and what they do
the command window
The review window
The variables window
The results window
Do files
Log files
data editor and data browser
•Set mem 50m
log using name.log,replace
Log close
Log off
Log on
7. Cross-sectional data
Summary statistics
Sum var
Sum var, sep(0)
Sum var, detail
Tabstat var, s(n me sd min max ske kur) c(s)
Tabstat var, s(n me sd min min max) by (var)
8. Cross-sectional data
Correlations
Pearson’s product-moment correlation (r)
It focuses on mean values
it is used for interval variables
Values below 0.30 suggest there is little association between the variables
(Hinkle et al. 1988).
pwcorr var, obs sig star (0.05)
Spearman’s correlation (rho)
it calculated based on ranks
it used for ordinal variables
spearman var, stats (rho obs p) star (0.05)
(Cont’d)
9. Cross-sectional data
Differences in Means and medians
Independent two-sample t-test
it helps to know if there are mean differences in data that might be interesting to pursue
with multivariate analysis
there can not be more than two groups on witch you are comparing the mean value-the
grouping variable must be dichotomous.
ttest var, by (grouping var)
sdtest var, by (grouping var)
Mann-Whitney U-test
it is used to examine the rank differences across some characteristic for two groups.
ranksum var, by (grouping var)
Paired t-test and Wilcoxon signed rank test
ttest ind07==ind08
signrank ind07==ind08
(Cont’d)
10. Cross-sectional data
Theory of regression analysis
What is linear regression analysis?
• Finding the relationship between a dependent and
an independent variable.
Y= α + bx + e
(Cont’d)
12. Regression diagnostics (Cont’d).
Normality refers to normal distribution of the error terms
Testing the residuals for normality
Shapiro-Wilk W test
Swilk res
Smirnov-Kolmogorov test
Sktest res
Testing the normality for a variable
Sktest var
Tabstat var, s(sk kur)
13. Regression diagnostics (Cont’d).
Outliers detection
Outlier detection involves the determination whether the residuals
(errors=predicted-actual) is an extreme negative or positive value.
Standardized residuals
predict residstd, rstandard
List residstd
if the standardized residuals have values in excess of 3.5 and -3.5 they are
outliers.
Cook’s D
Predict cook, cooksd
List cook. If cook > 4/n
Winsorization
Winsor2 (var), replace cuts (1 0.99)
14. Regression diagnostics (Cont’d).
Heteroskedasticity
Refers to a situation in which the error terms of the model have no
constant variances. This problem should be addressed as sometimes can
make significant variables appear to be statistically insignificant.
Testing the residuals for heteroskedasticity
hettest
Solving heteroskedasticity problem
reg var, robust
15. Regression diagnostics (Cont’d).
Multicollinearity
Refers to a high correlation of two or more independent variables in a
regression model. This problem may affect the regression estimates.
Testing for multicollinearity
vif
Solving multicollinearity problem
Centering or standardizing approach
16. Regression diagnostics (Cont’d).
Model specification
refer to including all relevant and excluding all irrelevant variables.
Testing for model specification
ovtest
Linktest
17. Other regression commands
Logistic Regression
logistic var
Probit Regression is the other main method for analysing binary
dependent variables. Whereas logit (or logistic) regression is based on log
odds, probit uses the cumulative normal probability distribution.
probit var
Poisson Regression is for a count (non-negative integers)
dependent variable
poission var
18. Presenting your results
For descriptive and correlation results
Edit copy table
Open a blank word document and press paste
Table convert text to table
For regression results
esttab
esttab, se ar2
19. • The difference between cross-sectional, time
series and panel data
• Why panel?
• More observations mean more information
• Certain structure of the data allow better use
of the data
20.
21. • Data need to be set as panel in Stata (time
and individual dimensions)
• Summary statistics for panel, xtsum, xtdes …
• Fixed effects
• Random effects models
• Pooled OLS
22. • Hausman test
• Breusch and Pagan Lagrangian Multiplier (LM)
test
• Modified Wald test for groupwise heteroskedasticity
• Wooldridge test for autocorrelation in panel data
• Pesaran's test of cross sectional dependence
23.
24. Suggested readings
Gujarati & Porter (2010) “Essentials of econometrics”, McGraw-Hill,
New York.
Cameron & Trivedi (2009) “ Microeconometrics Using Stata”, A Stata
Press Publication, Stata LP, College Station, Texas, USA.
Pevaline & Robson (2009) “the Stata Survival Manual”, Two Penn
Plaza, New York, USA.
Woorldridge (2003) “ Introductory econometrics: A modern approach
(2nd Ed.), Thomsom South-Western, USA.