2. Outline
• The Stata Platform
• Storing Commands and Output
• Examining dataset
• Descriptive Statistics
• Creating and Modifying Variables
• Advanced Descriptive Statistics
• Presenting Data with Graph
• Normality and Outlier
• Statistical Tests
• Linear Regression
• Data Management
8. Housekeeping Commands
• The Global macros
– Here we use it to store file locations (but it has many other
uses)
• We can define the path of our file using
global mydata " D:...Data”
• Whenever we need to refer to this path we can write
$mydata
9. Housekeeping Commands
• The cd (Change Directory) command
– On its own, identifies the current working
directory
– Followed by a path, changes the current working
directory to the one on the path
cd "D:...Data”
Or
cd “$mydata”
10. Storing Commands and Output
• The following topics are covered:
– Using the Do-file Editor
– log using
– log off
– log on
– log close
– set logtype to move tables from Stata to Word and
Excel
11. Storing Commands and Output
• Using the Do-file Editor
– The Do-file Editor allows you to store a program
(a set of commands),
– It makes checking and fixing errors easier,
– It allows you to run the commands later,
– It lets you share your procedures with
collaborators or reviewers, and
– It allows you to collaborate with others on the
analysis.
12. Storing Commands and Output
• Any time you are running more than 10
commands, it is easier and safer to use a Do-
file to store the commands
• To open the Do-file Editor, you can
– click on Windows/Do-file Editor or
– click on the icon on the Tool Bar.
13. Storing Commands and Output
• keyboard commands are quicker to use than
the buttons. The most useful ones are:
• Control-O Open file
• Control-S Save file
• Control-C Copy
• Control-X Cut
• Control-V Paste
• Control-Z Undo
• Control-F Find
• Control-H Find and Replace
14. Storing Commands and Output
• Adding comments to a do-file
– To add comment on a single line
* We can put an asterisk and write the command
– To add a comment in multiple lines
/* open a bracket like this
and end it by closing the bracket like this */
–To add a comment after a command
Command // write the comment after 2 slashes
15. Storing Commands and Output
• To run the commands in a Do-file,
– you can click on the Do button or
– click on Tools/Do or
– Use Ctrl+D
– If you want to run one or just a few commands
rather than the whole file, mark the commands and
click on the Do button
16. Storing Commands and Output
• Saving the Output
– Stata Results window does not keep all the output
you generate.
– when it is full, it begins to delete the old results as
you add new results.
– Thus, we need to use log to save the output
17. Storing Commands and Output
• log using
– This command creates a file with a copy of all the
commands and output from Stata. The syntax is:
log using filename [, append replace [ text | smcl ] ]
• append adds the output to an existing file
• replace replaces an existing file with the output
• text tells Stata to create the log file in text
(ASCII) format
• smcl tells Stata to create the log file in
SMCL format
18. Storing Commands and Output
• Here are some examples:
– log using temp22 saves output to a file
called temp22
– log using temp22, replace
saves output to an existing file,
temp20, replacing content
– log using temp22, append
saves output to an existing
file, results, adding to contents
– log using “$mydatamyfile”,replace
saves output in specified file in
specified folder
19. Storing Commands and Output
• log off
– This command temporarily turns off the logging of
output,
• log on
– This command is used to restart the logging,
• log close
– This command is used to turn off the logging and
save the file.
20. Storing Commands and Output
• set logtype text
– This command tells Stata to always save the log
files in text (ASCII) format
• set logtype smcl
– This command tells Stata to always save log files in
SMCL format.
21. Examining dataset
• clear
– The clear command deletes all files, variables, and
labels from the memory to get ready to use a new
data file
– You can clear memory using the clear command or
by using it as part of the use command
– This command does not delete any data saved to
the hard-drive
22. Examining dataset
• set memory
– First you can check to see how much memory is
allocated to hold your data using the memory
command
– By default we have 11MB free for reading in a data
file.
– Whenever we want to read data file bigger than this
free bytes, we will get the error message read as:
no room to add more observations
r(901);
23. Examining dataset
– In this case we have to allocate to more memory, say
25MB (if 25MB are sufficient for current file), with the
set memory command before trying to use our file.
set memory 25m
– Now that we have allocated enough memory, we will
be able to read bigger files provided that it is within
the specified memory spaces
– If we want to allocate 25m (25 megabytes) every time
we start Stata, We can type;
set memory 25m, permanently
24. Examining dataset
• use
– This command opens an existing Stata data file.
• The syntax is:
use filename [, clear ] opens the file ‘filename’
use [varlist] [if exp] [in range] using filename [, clear ]
opens selected parts of file
– If there is no path, Stata assumes it is in the current folder.
– You can use a path name such as: use C:...ERHScons1999
– If the path name has spaces, you must use double quotes:
use .”d:my dataERHScons1999”
25. • Logical operators used in Stata
~ Not
== Equal
~= not equal
!= not equal
> greater than
>= greater than or equal
< less than
<= less than or equal
& And
| Or
Examining dataset
26. Examining dataset
Here are some examples on the use command:
• use ERHScons1999 opens the file ERHScons1999.dta for
analysis.
• use ERHScons1999 if q1a == 1 opens data from region 1
• use ERHScons1999 in 5/25 opens records 5 through 25 of file
• use hhsize cons using ERHScons1999 opens 3 variables from
ERHScons1999 file
• use C:training ERHScons1999 opens the file ERHScons1999.dta in the
specified folder
• use “$mydata ERHScons1999” use quotation marks if there are
spaces
• use ERHScons1999, clear clears memory before opening the new
file
27. Examining dataset
• save
– The save command will save the dataset as a .dta file under the
name you choose.
Open a subset of a dataset (for region 1 = Tigray only)
use erhscons1999 if q1a==1, clear
Save this data as a new file with the name tigray
save tigray, replace
• The replace option allows you to save a changed file to the
disk, replacing the original file. Stata is worried that you will
accidentally overwrite your data file. You need to use the
replace option to tell Stata that you know that the file
exists and you want to replace it.
28. Examining dataset
• Open the training dataset
use ERHScons1999, clear
• edit
– This command is used to open the data editor window
that allow us to view observations as a spreadsheet
– You can change the data using data editor window but
it is not recommend to edit data using this window
– It is better to correct errors in the data using a Do-file
program that can be saved
29. • browse
– This window is exactly like the data editor window,
except that you can’t change the data in this case
• describe
– This command provides a brief description of the data
file. You can use “des” or “d” as a short hand for
describe.
– The output includes:
• the number of variables
• the number of observations (records)
• the size of the file
• the list of variables and their characteristics
Examining dataset
30.
31. Examining dataset
• list
– This command lists values of variables in data set.
The syntax is:
list [varlist] [if exp] [in range]
• examples:
– list lists entire dataset
– list in 1/10 lists observations 1 through 10
– list hhsize q1a food lists selected variables
– list hhsize sex in 1/20 lists observations 1-20 for selected
variables
– list if q1a < 6 lists cases in region is 1 through 5
32. Examining dataset
• if
– This command is used to select certain records in
carrying out a command
command if exp
Examples:
– list hhid q1a food if food >1200 lists data if food is above 1200
– tab q1a if cons>1000 &cons<2000 frequency table of region if
consumption is in range
– summarize food if q1a==3 | q1a==4 statistics on food Consumption
for regions 3 and 4
– browse hhid q1a food if food >=1200 browse data if food
consumption is above 1200
• Note that “if” statements always use ==, not a single =
33. Examining dataset
• in
– We have also used in to select records based on
the case number.
– The syntax is:
command in exp
For example:
• list in 10 list observation number 10
• summarize in 10/20 summarize observations
10-20
• l in -10/-1 list the last 10 observations
34. Examining dataset
• codebook
– The codebook command is a great tool for getting
a quick overview of the variables in the data file.
– It produces a kind of electronic codebook from
the data file, displaying information about
variables' names, labels and values
. codebook
sexh Sex of household head
----------------------------------------------------------------------------
type: numeric (byte)
label: sexhh
range: [0,1] units: 1
unique values: 2 missing .: 0/1452
tabulation: Freq. Numeric Label
400 0 Female
1052 1 Male
35. Examining dataset
• inspect
– It is another useful command for getting a quick
overview of a data file.
– inspect command displays information about the
values of variables and is useful for checking data
accuracy . inspect sexh
sexh: Sex of household head Number of Observations
---------------------------- Non-
Total Integers Integers
| # Negative - - -
| # Zero 400 400 -
| # Positive 1052 1052 -
| # ----- ----- -----
| # # Total 1452 1452 -
| # # Missing -
+---------------------- -----
0 1 1452
(2 unique values)
sexh is labeled and all values are documented in the label.
36. Examining dataset
• count
– count command can be used to show the number
of observations that satisfying if options. If no
conditions are specified, count displays the
number of observations in the data.
count
1452
count if q1a==3
466
37. Examining dataset
Common Stata Syntax
• Stata commands follow the syntax:
[by varilist1:] command [varlist2] [if exp] [in
range] [weight], [options]
• Items inside of the squares brackets are either
options or not available for every command.
• This syntax applies to all Stata commands
38. Descriptive Statistics
• tabulate, tab1, tab2
–These are three related commands that
produce frequency tables for discrete
variables.
–They can produce one-way or two-way
frequency tables
39. Descriptive Statistics
• tabulate or tab produce a frequency table
for one or two variables
• tab1 produces a one-way
frequency table for each
variable in the variable list
• tab2 produces all possible two-
variable tables from the
list of variables
40. Descriptive Statistics
You can use several options with these commands:
• all gives all the tests of association for two-way
tables
• cell gives the overall percentage for two-way
tables
• column gives column percentages for two-way
tables
• row gives row percentages for two-way tables
• nofreq suppresses printing the frequencies.
• chi2 provides the chi squared test for two-way
tables
41. Descriptive Statistics
Some examples of the tabulate commands are:
• tabulate q1a produces table of frequency by region
• tabulate q1a sexh produces a cross-tab of
frequencies by region and sex of head
• tabulate q1a hhsize, row produces a cross-tab by
region and hhsize with row
percentages
• tabulate sexh hhsize, cell nofreq produces a cross-tab of overall
percent by sex and hhsize.
• tab1 q1a q1b hhsize produces three tables, a
frequency table for each
variable
• tab2 q1a poor sexh produces three tables, a cross-
tab of each pair of variables
42. Descriptive Statistics
• summarize
– The summarize command produces statistics on continuous variables like age,
food, cons hhsize. The syntax looks like this:
summarize [varlist] [if exp] [in range] [, [detail]]
By default, it produces the following statistics:
• Number of observations
• Average (or mean)
• Standard deviation
• Minimum
• Maximum
If you specify “detail” Stata gives you additional statistics, such as
• skewness,
• kurtosis,
• the four smallest values
• the four largest values
• various percentiles.
43. Descriptive Statistics
• Here are some examples:
• summarize gives statistics on
all variables
• summarize hhsize food gives statistics on
selected
variables
• summarize hhsize cons if q1a==3 gives statistics on
two variables for
one region
44. Descriptive Statistics
• bysort
– This prefix goes before a command and asks Stata
to repeat the command for each value of a variable.
The general syntax is:
bysort varlist: command
• Example:
– bysort sex: sum rconsae for sex of hh head, give stats
on real per capita consumption
45. Descriptive Statistics
• help
– The help command gives you information about any
Stata command or topic
help [command]
For example,
• help tabulate gives a description of
the tabulate command
• help summarize gives a description of the
summarize command
46. Creating New Variables
• We have seen how to explore the data using
existing variables so far.
• Now we will discuss how to create new
variables.
• When new variables are created, they are in
memory and they will appear in the Data
Browser,
– but they will not be saved on the hard-disk unless
you use the save command
47. Creating New Variables
• generate
– This command is used to create a new variable. It
is similar to “compute” in SPSS.
• The syntax is;
generate newvar = exp [if exp]
where “exp“ is an expression like
“food/hhsize” or
“20*cons”
48. Creating New Variables
• The command cannot be used to modify an
existing variable
• You can use “gen“ or “g” as an abbreviation
for “generate“
• If the expression is an equality or inequality,
the variable will take the values 0 if the
expression is false and 1 if it is true
• If you use “if“, the new variable will have
missing values when the “if“ statement is false
49. Creating New Variables
• For example,
– gen age2 = age*age
• create age squared variable
– gen yield = outputkg/area if area>0
• create new yield variable if area is positive
– gen price = value/quant if quant>0
• create new price variable if quant is positive
– gen smhh= (hhsize<4)
• creates a dummy variable equal to 1 for smaller
households (less than 4 memebrs)
50. • replace
– This command is used to change the definition of
an existing variable.
• The syntax is the same:
replace oldvar = exp [if exp] [in exp]
Creating New Variables
51. Creating New Variables
• For example,
replace cons=. if cons<0
replaces negative consumption with missing
value
replace price = avgprice if price > 100000
replaces high values with an average price
replace age = 25 in 1007
replace age=25 in observation #1007
52. Creating New Variables
• tabulate … generate
– This command is useful for creating a set of
dummy variables (variables with a value of 0 or 1)
depending on the value of an existing categorical
variable.
• The syntax is:
tabulate oldvariable, generate(newvariable)
53. abs(x) computes the absolute value of x
exp(x) calculates e to the x power.
ln(x) computes the natural logarithm of x
log(x) is a synonym for ln(x), the natural logarithm.
log10(x) computes the log base 10 of x.
sqrt(x) computes the square root of x.
invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z.
normden(z) provides the standard normal density.
normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not
missing, otherwise, the result is missing.
norm(z) provides the cumulative standard normal.
group(x) creates a categorical variable that divides the data into x as nearly equal-
sized subsamples as possible, numbering the first group 1, the second
group 2, etc. It uses the current order of the data.
int(x) gives the integer obtained by truncating x.
round(x,y) gives x rounded into units of y.
54. Creating New Variables
tab q1a, gen(region)
• This creates 6 new variables:
region1=1 if q1a=1 and 0 otherwise
region2 =1 if q1a =3 and 0 otherwise
……
region6=1 if q1a =8 and 0 otherwise
55. Creating New Variables
• egen
– This is an extended version of “generate”[extended
generate] to create a new variable by aggregating the
existing data.
• The syntax is:
egen newvar = fcn(arguments) [if exp] [in range] , by(var)
56. Creating New Variables
• count() number of non-missing
values
• diff() compares variables, 1 if
different, 0 otherwise
• fill() fill with a pattern
• group() creates a group id
from a list of variables
• iqr() interquartile range
• ma() moving average
• max() maximum value
• mean() mean
• median() median
• min() minimum value
• pctile() percentile
• rank () rank
• rmean() mean across
variables
• sd () standard deviation
• std() standardize
variables
• sum () sums
57. Creating New Variables
• egen avg = mean(cons)
creates variable of average consumption
over entire sample
• egen avg2 = median(cons), by(sex)
creates variable of median consumption
for each sex
• egen regprod = sum(cons), by(q1a)
creates variable of total consumption for
each region
58. Creating New Variables
• Exercise,
• we want to know which households have
expenditure (cons) above the village average.
• I.e. Create a dummy (1 for those who
consume above the village/peasant
association average and 0 otherwise)
59. Creating New Variables
• egen avecon=mean(cons), by( q1c)
• gen highavecon=(cons> avecon & cons!=.)
• list hhid q1c cons avecon highavecon in 650/675
60. Creating New Variables
• Arithmetic
+ addition
- subtraction
* multiplication
/ division
^ power
• Logical
~ not
| or
& and
• Relational
> greater than
< less than
>= more than or equal
<= less than or equal
== equal
~= not equal
!= not equal
61. Creating New Variables
• Here are some examples to illustrate the use of these
operators. Suppose you want you create a
– dummy variable indicating households in the
Amhara region.
– One way to do it is to run:
generate AmD = 0
replace AmD = 1 if q1a==3
– Or you can get exactly the same result with just:
generate AmD2 = (q1a==3)
compare AmD AmD2
62. Creating New Variables
• For example, generate a dummy that would
identify observations with male household
heads in Dodota wereda.
gen DDfemale = 0
replace DDfemale = 1 if q1b==9 & sexh==0
or an easier way to do this would be:
gen DDfemale2 = (q1b==9 & sexh==0)
63. Creating New Variables
• recode
– This command changes the values of a categorical
variable according to the rules specified.
• The syntax is:
recode varname old#=new# old#=new# [if exp] [in range]
64. Creating New Variables
• Notice that you can use some special symbols
in the rules:
* means all other values
. means missing values
x/y means all values from x to y
x y means x and y
• For example, recode region value 8 and 9 to 7
65. Creating New Variables
• Here are some examples:
• recode x 1=2 changes all values of x=1 to x= 2
• recode x 1=2 3=4 changes 1 to 2 and 3 to 4
• recode x 1=2 2=1 exchanges the values 1 and 2 in x
• recode x 1=2 *=3 changes 1 in x to 2 and all other
values to 3
• recode x 1/5=2 changes 1 through 5 in x to 2
• recode x 1 3 4 5 = 6 changes 1, 3, 4 and 5 to 6
• recode x .=9 changes missing to 9
• recode x 9=. changes 9 to missing
66. Creating New Variables
• xtile
– This command creates a new variable that
indicates which category a record falls into, when
the sample is sorted by an existing variable and
divided into “n” groups of equal size.
• Example: xtile can be used to create a variable
that indicates which income quintile a
household belongs to
67. Creating New Variables
• The syntax is:
xtile newvar = variable [if exp] [in range] , nq(#)
– where newvar is the new categorical variable
created; variable is the existing variable used to
create the quantile (e.g income, farm size); # is the
number of different categories (eg 5 for quintiles,
3 for terciles)
69. Modifying Variables
• We begin with an explanation of how to label
data in Stata. Then see how to format
variables.
– rename variable
– label variable
– label define
– label values
– format variable
70. Modifying Variables
• rename variables
– This command is used to rename variables in
order to give other variable name.
– The syntax is
rename old_variable new_variable
• Example: Generate a dummy for the region
variable and rename the new dummy
variables accordingly
71. Modifying Variables
Current
residence • Region
Place of
Birth • Region
Migrate
to • Region
Define lables Save label definitions
Attach the defined
lables to a variable
Label Define
Label
Values
72. Modifying Variables
• label define
– This command gives a name to a set of value
labels. For example, instead of numbering the
regions, we can assign a label to each region.
• The syntax is:
label define lblname # "label" # "label" # “label“
[, add modify]
73. Modifying Variables
• label values
– This command attaches named set of value labels
to a categorical variable.
• The syntax is:
label values varname [lblname] [, nofix]
74. Modifying Variables
label define reg 1"Tigray" 3"Amhara" 4"Oromia"
7"SNNP",modify
label values q1a reg
• Some additional commands that may be
useful in labeling
– label dir to request a list of existing label names
– label list to request a list of all the existing value
labels
– label drop to delete a one or more labels
– label save using to save label definitions as a Do-file
– label data to give a label to a data file
75. • tabulate … summarize
– This command creates one- and two-way tables
that summarize continuous variables. With the
“summarize” option, we can put means and other
statistics of a continuous variable.
• The syntax is:
tabulate varname1 varname2 [if exp] [in range],
summarize(varname3) options
• You can specify which statistics with options “means”,
“standard” and “freq”
Advanced Descriptive Statistics
76. • Some examples:
• tab q1a, sum(cons) gives the mean, std
deviation, and frequency of
per capita expenditure for
each region
• tab q1b, sum(cons) mean gives the mean
consumption for each
village
• tab q1a sexh, sum(food) gives the mean, std
deviation, and frequency in
each cell of hh head sex per
region
Advanced Descriptive Statistics
77. Advanced Descriptive Statistics
• tabstat
– This command gives summary statistics for a set of
continuous variable for each value of a categorical
variable.
• The syntax is:
tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname)
• where
– varlist is a list of continuous variables
– statname is a type of statistic
– varname is a categorical variable
78. Some facts about this command:
– The default statistic is the mean.
– Optional statistics subcommands include mean, sum, max,
min, range, sd (standard deviation), var (variance),
skewness, kurtosis, median, and pn (nth percentile).
– Without the by() option, tabstat is like “summarize” except
that it allows you to specify the list of statistics to be
displayed.
– With the by() option, tabstat is like "tabulate … summarize
“except that tabstat is more flexible in the statistics and
format
Advanced Descriptive Statistics
79. • Examples
– tabstat food hhsize, stats(mean max min) gives mean,
max, and min of food & hhsize
– tabstat food hhsize, by(q1a) gives mean of two
variables for each region
– tabstat food, stats(median) by(q1a) gives the median
food consumption
for each region
• The tabstat command displays summary statistics for
a series of numeric variables in a single table.
Advanced Descriptive Statistics
80. • table
– This command creates a wide variety of tables. It is
probably the most flexible and useful of all the
table commands in Stata.
• The syntax is:
table rowvar colvar [if exp] [in range], c(clist) [row col]
• where
– rowvar is the categorical row variable
– colvar is the categorical column variable
– clist is a list of statistic and variables
– row is an option to include a summary row
– col is an option to include a summary column
Advanced Descriptive Statistics
81. • Some useful facts about this command:
– The default statistic is the frequency.
– Optional statistics are mean, sd, sum, rawsum
(unweighted), count, max, min, median, and pn
(nth percentile).
– The c( ) is short for contents of each cell.
– Like tab, it can be used to create one- and two-
way frequency tables, but table cannot do
percentages
Advanced Descriptive Statistics
82. Advanced Descriptive Statistics
• Useful facts (cont.) :
– Like tab…sum, it can be used to calculate basic stats for
each value of a categorical variable
– Its advantage over tab…sum is that it can do more
statistics and it can take more than one continuous
variable
– Like tabstat, it can be used to calculate advanced stats for
each value of a categorical variable
– Its advantage over tabstat is that it can use two (and
more) way tables, but its disadvantage is that it has fewer
statistics.
83. • Here are some examples:
– table q1a , row table of frequencies by region with total row
– table q1a, c(mean cons) table of average consumption by
region
– table q1a, c(mean food sd food median food) table of food Consumption
statistics by region
– table q1a, c(mean cons) format(%9.2f) table of average consumption
by region with format .
– table q1a sexh, c(mean cons) table of average consumption by
region and sex
– table q1a sexh, c(mean cons mean food) table of avg consumption &
food consumption by region
& sex
Advanced Descriptive Statistics
84. Presenting Data with Graph
• The commands that draw graphs are
graph twoway scatterplots, line plots,
graph matrix scatterplot matrices
graph bar bar charts
graph dot dot charts
graph box box-and-whisker plots
graph pie pie charts
85. Presenting Data with Graph
• Examples
graph twoway scatter cons food
• We can show the regression line predicting
cons from food using lfit option.
twoway lfit cons food
• The two graphs can be overlapped like this
twoway (scatter cons hhsize) (lfit cons hhsize)
twoway (scatter cons food) (lfit cons food)
86. Presenting Data with Graph
• Labeling graphs
scatter var1 var2, title("title") subtitle("subtitle")
xtitle("xtitle") ytitle("ytitle") note("note")
• Example
scatter ageh cons , title("title") subtitle("subtitle")
xtitle("xtitle") ytitle("ytitle") note("note")
87. Normality and Outlier
• skewness and kurtosis
sum rconspc
sum rconspc, detail
• check normality of a variable visually by
looking at some basic graphs
histogram rconspc
histogram rconspc, normal
histogram rconspc, normal bin(100)
89. Normality and Outlier
• graph box draws vertical box plots
graph box rconspc, by(sexh)
– y axis is numerical, and the x axis is categorical
– upper and lower bounds of box are defined by the
25th and 75th percentiles
– line within the box is the median
– ends of the whiskers are 5th and 95th percentile
• If rconspc is normal, the median would be in the
center of the box and the end of whiskers would be
equidistant from the box
90. Normality and Outlier
• The kdensity command with the normal option
kdensity rconspc, normal
– density graph of the residual with a normal distribution
superimposed on the graph
– useful in verifying that the residuals are normally
distributed
• pnorm command produces a P-P plot
pnorm rconspc
– It should be approximately linear if the variable follows
normal distribution
91. Normality and Outlier
• Qnorm command plots the quantiles of a variable
against the quantiles of a normal distribution
qnorm rconspc
– If the Q-Q plot shows a line that is close to the 45 degree
line, the variable is more normally distributed
• Both P-P and Q-Q plot prove that rconspc is not normal, with
a long tail to the right
• The qnorm plot is more sensitive to deviances from normality
in the tails of the distribution
• The pnorm plot is more sensitive to deviances near the mean
of the distribution
92. Normality and Outlier
• Dealing with outliers
– We have the following options when we have
outliers
• delete them from analyses
• use measures that are not sensitive to them, such as
median instead of mean
• transform the data to be more normal
• to replace them by imputation
93. Normality and Outlier
/* Calculate number of standard deviations from median by sex of hh head */
egen median=median(rconspc), by (sexh)
egen sd=sd(rconspc), by (sexh)
*generate the ratio of the deviation from the median to the standard deviation
gen ratio=abs((rconspc-median)/sd)
.
sd
median
rconspc
ratio
94. Normality and Outlier
*generate an outlier dummy if the value is 3 times the ratio above
gen outlier=1 if ratio>3 & ratio~=.
replace outlier=0 if outlier==. & ratio~=.
tabulate outlier, missing
table sexh outlier, contents(mean rconspc) row col missing
95. Normality and Outlier
• Listwise deletion
histogram rconspc if outlier==0, normal
• Data transformation
– a log transformation
gen lnrconspc=ln(rconspc)
histogram lnrconspc if rconspc~=., normal
• Imputation
– First the analyst estimates a regression model in which the
dependent variable has missing values
– In the second step, the estimated regression coefficients
are used to predict (impute) missing values of that variable
96. Normality and Outlier
* Replace outliers to missing
replace rconspc=. If outlier==1
regress lnrconspc i.q1a i.sexh i.poor hhsize ageh, robust
predict yhat
replace lnrconspc=yhat if rconspc==.
• Or
xi: impute lnrconspc i.q1a i.sexh i.poor hhsize ageh, gen(imputed)
97. Statistical Tests
• compare
– The compare command is an easy way to check if
two variables are the same
compare lnrconspc imputed
• correlate command
– The correlate command displays a matrix of
Pearson correlations for the variable listed.
correlate cons hhsize
correlate cons hhsize, means
pwcorr cons hhsize, sig
98. Statistical Tests
• ttest command
– If, for example, we like to see if the mean of hhsize
equals to 6 by using single sample t-test, ttest
command is used for this purpose.
ttest hhsize=6
• We can also test if cons and food have the same
mean
ttest cons=food
99. Statistical Tests
• On the Side – How to interpret the P-values
– Read the p-value for the results
– Convert it to percentage (100*p)
– Now let X=(100*p)
– Decision rule
• If , reject Ho at 1% level of significance
• If , reject Ho at 5% level of significance
• If , reject Ho at 10% level of significance
5
1
X
10
5
X
1
X
100. Statistical Tests
• ttest command for independent groups with pooled
(equal) variance
ttest cons, by(sexh)
• ttest command for independent groups using
unequal variance
ttest cons, by(sexh) unequal
• hotelling command performs Hotelling's T-squared
test of whether the means are equal between two
groups.
hotel cons, by(sexh)
101. Linear Regression
• Regression analysis involves estimating an
equation that best describes the data
• One variable is considered the dependent
variable, while the others are considered
independent (or explanatory) variables
• Stata is capable of many types of regression
analysis and associated statistical test
• Here we touch on only a few of the more
common commands and procedures
102. • regress
– This is an example of ordinary linear regression by using
regress command.
reg cons hhsize
– This regression tells us that for every extra person (hhsize)
added to a household, total monthly expenditure (cons) will
increase by about 40 Ethiopia Birr
– This increase is statistically significant as indicated by the
0.000 probability associated with this coefficient
Linear Regression
103. – r-squared (r2) which equals to 0.0676. This value tells us
that our independent variable (hhsize) accounts for
approximately 7% of the variation of dependent variable
(cons)
– Running a regression with robust standard errors will
tolerate a non-zero percentage of outliers, i.e., when the
residuals are not iid
– This is very useful when there is hetroscedasticity of
variance.
– The robust option does not affect the estimates of the
regression coefficients
reg cons hhsize, robust
Linear Regression
104. – Stata stores results from estimation commands in e(), and
you can see a list of what exactly is stored using the
ereturn list command.
ereturn list
– Using the generate command, we can extract those results,
such as estimated coefficients and standard errors, to be
used in other Stata commands.
• reg cons hhsize
• gen intercept=_b[_cons]
• display intercept
• gen slope=_b[hhsize]
• display slope
Linear Regression
105. – The estimates table command displays a table
with coefficients and statistics for one or more
estimation sets in parallel columns
estimates store estimatename
estimates table, b se t p
– The predict command computes predicted value
and residual for each observation
predict pred
– When using the resid option the predict command
calculates the residual.
predict e, residual
Linear Regression
106. – We can plot the predicted value and observed value using
graph twoway command.
regress cons food
predict pred
graph twoway (scatter cons food) (line pred food)
– The rvfplot command generates a plot of the residual
versus the fitted values. It is used after regress command.
regress cons food
rvfplot
– The rvpplot command produces a plot of the residual
versus a specified predictor
rvpplot food
Linear Regression
107. • Hypothesis tests
– The test command performs Wald tests for simple
and composite linear hypotheses about the
parameters of estimation
recode q1a 7/9=7
gen reg1=q1a==1
gen reg3=q1a==3
gen reg4=q1a==4
gen reg7=q1a==7
regress cons hhsize reg1 reg3 reg4 reg7
Linear Regression
108. test reg3=0
test reg3= reg4= reg7
– The test command test the hypothesis that region 3
variable is zero (test reg3=0) and all region
variables (region3= region4 = region 7) are zero,
finding that the probability is very low (less than
0.000) so we can reject this hypothesis.
– If you want to test the joint significance of a set of
related variable, you can use
testparm reg* test of hypothesis that all
region dummies are zero
Linear Regression
109. • Ramsey RESET to test for omitted variables
(misspecification)
ovtest [, rhs]
– This test amounts to estimating y = xb+zt+u and
then testing t=0
regress cons hhsize reg3 reg4 reg7
ovtest tests significance of powers of
predicted cons
ovtest, rhs tests significance of powers of
hhsize, reg3, reg4 and reg7
Linear Regression
110. • Example;
ovtest
Ramsey RESET test using powers of the fitted values of cons
Ho: model has no omitted variables
F(3, 1441) = 4.47
Prob > F = 0.0039
– The ovtest, reject the hypothesis that there are no
omitted variables, indicated that we need to
improve the specification
Linear Regression
111. • Heteroskedasticity
– We can use the hettest command to run an
auxiliary regression of on the fitted values.
hettest
Ho: Constant variance
Variables: fitted values of cons
chi2(1) = 81.50
Prob > chi2 = 0.0000
– The hettest indicates that there is
heterorskedasticity which needs to be dealt with
Linear Regression
112. • We can also use information matrix test by
imtest command, which provides a summary
test of violations of the assumptions on
regression errors.
imtest
• The imtest also approved existence of
heteroskedasticity, skweness and kurtosis
problems
Linear Regression
113. – The xi prefix is used to dummy code categorical
variables, and we tag these variables with an “i.”
in front of each target variable
xi: regress cons hhsize i.q1a, robust
– By default, Stata selects the first category in the
categorical variable as the reference category. If
we would like to declare a certain category as
reference category
char q1a[omit] 7
xi:regress cons hhsize i.q1a, robust
Linear Regression
114. – Logistic regression
logistic poor hhsize ageh sexh, coef
xi:logit poor hhsize ageh sexh i.q1b
ereturn list
estat summarize
estat ic
mfx, (options)
– Options
dydx is the default.
eyex specifies that elasticities be calculated in the form of d(lny)/d(lnx)
dyex specifies that elasticities be calculated in the form of d(y)/d(lnx)
eydx specifies that elasticities be calculated in the form of d(lny)/d(x)
Linear Regression
115. Data Management
• We can subset data by keeping or dropping
variables, or by keeping and dropping
observations
– keep and drop variables
• The keep command is used to keep variables in the list
while dropping other variables
• The drop command is used to delete variables in the
list while keeping other variables
– keep and drop observations
• The keep if command is used to keep observations if
condition is met and vice versa for drop
116. Data Management
• sort
– The sort command arranges the observations of the
current data into ascending order based on the values of
the variables listed
• Variable ordering
– The order command helps us to organize variables in a way
that makes sense by changing the order of the variables
• by command, _N is the total number of observations
within each group listed in by command, and _n is
the running counter to uniquely identify
observations within the group
117. Data Management
• Often we don’t have all the info that we need
in one dataset, and we have to merge them
into one (since STATA allows for only one
dataset in memory).
• There are several types of “merging”
datasets…
118. Data Management
• As long as the variables
in the files are the same
and the only thing you
need to do is to add
observations, this is
vertical combination.
• For this we use the
append command.
• Since this is used less
often, I will skip it, but
you can look at it in the
help file.
119. Data Management
• Appending data files
– concatenates two datasets, that is, stick them
together vertically, one after another
use tigray.dta, clear
append using amhara.dta
– The append command does not require that the
two datasets contain the same variables. But it
highly recommended to use identical list of
variables for append command to avoid missing
values from one dataset
120. Data Management
• If the identifying
variable which
appears in the files is
unique in both files,
then it's a one-to-one
match. Unique means
that for each value of
this variable, there is
only one observation
that contains it. In the
figure below, country
is the identifying
variable. In both
datasets, each country
has only one
observation.
121. Data Management
• One-to-one match merging
• The merge command sticks two datasets horizontally, one next to
the other. Before any merge, both datasets must be sorted by
identical merge variable
use hh_characters.dta, clear
merge 1:1 hhid using consum.dta
123. Data Management
• Collapse
– Sometimes we have data files that need to be
aggregated at a higher level to be useful for us.
For example, we have household data but we
really interested in regional data. The collapse
command serves this purpose by converting the
dataset in memory into a dataset of means, sums,
medians and percentiles
• For instance, we would like to see the mean cons in
each q1a and sex of hh head.
collapse (mean) cons, by(q1a sex)
124. Data Management
• The reshape wide command tells system that
we want to go from long to wide after
collapsing . The i() option records row variable
while j() column variable
reshape wide cons, i(q1a) j(sexh)
125. Importing Data
• The insheet command can import data in text format (Tab
delimited, or comma separated values CSV files).
• Syntax:
insheet [variable names] using <filename> [,options]
• Options:
– tab : tab-delimited data
– comma : comma-delimited data
– delimiter("char"): use char as delimiter
– clear: replace data in memory
– names : variable names are included on the first line of the file
• Example
cd “…Datafor stata training manual_EEA"
clear
insheet using ERHS_SPSS.csv, comma
126. Good Sites to Look At!
• STATA HELP – either online or in the software itself.
• http://stataproject.blogspot.com.
• http://www.stata.com/
• http://www.stata.com/statalist/
• http://ideas.repec.org/s/boc/bocode.html
• http://www.princeton.edu/~erp/stata/main.html
• http://www.cpc.unc.edu/services/computer/prese
ntations/statatutorial/
• http://www.ats.ucla.edu/stat/stata/
127. Good Sites to Look At!
• Statalist is hosted at the
Harvard School of Public
Health, and is an email
listserver where Stata users
including experts writing Stata
programs to users like us
maintain a lively dialogue
about all things statistical and
Stata. You
• can sign on to statalist so that
you can receive as well as
post your own questions
through email.