SlideShare une entreprise Scribd logo
1  sur  128
Introduction to the
Statistical Software Stata
(Release 12)
Outline
• The Stata Platform
• Storing Commands and Output
• Examining dataset
• Descriptive Statistics
• Creating and Modifying Variables
• Advanced Descriptive Statistics
• Presenting Data with Graph
• Normality and Outlier
• Statistical Tests
• Linear Regression
• Data Management
The Stata Platform
Results
Command
Variables
Variable
Properties
Review
Housekeeping Commands
• The Global macros
– Here we use it to store file locations (but it has many other
uses)
• We can define the path of our file using
global mydata " D:...Data”
• Whenever we need to refer to this path we can write
$mydata
Housekeeping Commands
• The cd (Change Directory) command
– On its own, identifies the current working
directory
– Followed by a path, changes the current working
directory to the one on the path
cd "D:...Data”
Or
cd “$mydata”
Storing Commands and Output
• The following topics are covered:
– Using the Do-file Editor
– log using
– log off
– log on
– log close
– set logtype to move tables from Stata to Word and
Excel
Storing Commands and Output
• Using the Do-file Editor
– The Do-file Editor allows you to store a program
(a set of commands),
– It makes checking and fixing errors easier,
– It allows you to run the commands later,
– It lets you share your procedures with
collaborators or reviewers, and
– It allows you to collaborate with others on the
analysis.
Storing Commands and Output
• Any time you are running more than 10
commands, it is easier and safer to use a Do-
file to store the commands
• To open the Do-file Editor, you can
– click on Windows/Do-file Editor or
– click on the icon on the Tool Bar.
Storing Commands and Output
• keyboard commands are quicker to use than
the buttons. The most useful ones are:
• Control-O Open file
• Control-S Save file
• Control-C Copy
• Control-X Cut
• Control-V Paste
• Control-Z Undo
• Control-F Find
• Control-H Find and Replace
Storing Commands and Output
• Adding comments to a do-file
– To add comment on a single line
* We can put an asterisk and write the command
– To add a comment in multiple lines
/* open a bracket like this
and end it by closing the bracket like this */
–To add a comment after a command
Command // write the comment after 2 slashes
Storing Commands and Output
• To run the commands in a Do-file,
– you can click on the Do button or
– click on Tools/Do or
– Use Ctrl+D
– If you want to run one or just a few commands
rather than the whole file, mark the commands and
click on the Do button
Storing Commands and Output
• Saving the Output
– Stata Results window does not keep all the output
you generate.
– when it is full, it begins to delete the old results as
you add new results.
– Thus, we need to use log to save the output
Storing Commands and Output
• log using
– This command creates a file with a copy of all the
commands and output from Stata. The syntax is:
log using filename [, append replace [ text | smcl ] ]
• append adds the output to an existing file
• replace replaces an existing file with the output
• text tells Stata to create the log file in text
(ASCII) format
• smcl tells Stata to create the log file in
SMCL format
Storing Commands and Output
• Here are some examples:
– log using temp22 saves output to a file
called temp22
– log using temp22, replace
saves output to an existing file,
temp20, replacing content
– log using temp22, append
saves output to an existing
file, results, adding to contents
– log using “$mydatamyfile”,replace
saves output in specified file in
specified folder
Storing Commands and Output
• log off
– This command temporarily turns off the logging of
output,
• log on
– This command is used to restart the logging,
• log close
– This command is used to turn off the logging and
save the file.
Storing Commands and Output
• set logtype text
– This command tells Stata to always save the log
files in text (ASCII) format
• set logtype smcl
– This command tells Stata to always save log files in
SMCL format.
Examining dataset
• clear
– The clear command deletes all files, variables, and
labels from the memory to get ready to use a new
data file
– You can clear memory using the clear command or
by using it as part of the use command
– This command does not delete any data saved to
the hard-drive
Examining dataset
• set memory
– First you can check to see how much memory is
allocated to hold your data using the memory
command
– By default we have 11MB free for reading in a data
file.
– Whenever we want to read data file bigger than this
free bytes, we will get the error message read as:
no room to add more observations
r(901);
Examining dataset
– In this case we have to allocate to more memory, say
25MB (if 25MB are sufficient for current file), with the
set memory command before trying to use our file.
set memory 25m
– Now that we have allocated enough memory, we will
be able to read bigger files provided that it is within
the specified memory spaces
– If we want to allocate 25m (25 megabytes) every time
we start Stata, We can type;
set memory 25m, permanently
Examining dataset
• use
– This command opens an existing Stata data file.
• The syntax is:
use filename [, clear ] opens the file ‘filename’
use [varlist] [if exp] [in range] using filename [, clear ]
opens selected parts of file
– If there is no path, Stata assumes it is in the current folder.
– You can use a path name such as: use C:...ERHScons1999
– If the path name has spaces, you must use double quotes:
use .”d:my dataERHScons1999”
• Logical operators used in Stata
~ Not
== Equal
~= not equal
!= not equal
> greater than
>= greater than or equal
< less than
<= less than or equal
& And
| Or
Examining dataset
Examining dataset
Here are some examples on the use command:
• use ERHScons1999 opens the file ERHScons1999.dta for
analysis.
• use ERHScons1999 if q1a == 1 opens data from region 1
• use ERHScons1999 in 5/25 opens records 5 through 25 of file
• use hhsize cons using ERHScons1999 opens 3 variables from
ERHScons1999 file
• use C:training ERHScons1999 opens the file ERHScons1999.dta in the
specified folder
• use “$mydata ERHScons1999” use quotation marks if there are
spaces
• use ERHScons1999, clear clears memory before opening the new
file
Examining dataset
• save
– The save command will save the dataset as a .dta file under the
name you choose.
Open a subset of a dataset (for region 1 = Tigray only)
use erhscons1999 if q1a==1, clear
Save this data as a new file with the name tigray
save tigray, replace
• The replace option allows you to save a changed file to the
disk, replacing the original file. Stata is worried that you will
accidentally overwrite your data file. You need to use the
replace option to tell Stata that you know that the file
exists and you want to replace it.
Examining dataset
• Open the training dataset
use ERHScons1999, clear
• edit
– This command is used to open the data editor window
that allow us to view observations as a spreadsheet
– You can change the data using data editor window but
it is not recommend to edit data using this window
– It is better to correct errors in the data using a Do-file
program that can be saved
• browse
– This window is exactly like the data editor window,
except that you can’t change the data in this case
• describe
– This command provides a brief description of the data
file. You can use “des” or “d” as a short hand for
describe.
– The output includes:
• the number of variables
• the number of observations (records)
• the size of the file
• the list of variables and their characteristics
Examining dataset
Examining dataset
• list
– This command lists values of variables in data set.
The syntax is:
list [varlist] [if exp] [in range]
• examples:
– list lists entire dataset
– list in 1/10 lists observations 1 through 10
– list hhsize q1a food lists selected variables
– list hhsize sex in 1/20 lists observations 1-20 for selected
variables
– list if q1a < 6 lists cases in region is 1 through 5
Examining dataset
• if
– This command is used to select certain records in
carrying out a command
command if exp
Examples:
– list hhid q1a food if food >1200 lists data if food is above 1200
– tab q1a if cons>1000 &cons<2000 frequency table of region if
consumption is in range
– summarize food if q1a==3 | q1a==4 statistics on food Consumption
for regions 3 and 4
– browse hhid q1a food if food >=1200 browse data if food
consumption is above 1200
• Note that “if” statements always use ==, not a single =
Examining dataset
• in
– We have also used in to select records based on
the case number.
– The syntax is:
command in exp
For example:
• list in 10 list observation number 10
• summarize in 10/20 summarize observations
10-20
• l in -10/-1 list the last 10 observations
Examining dataset
• codebook
– The codebook command is a great tool for getting
a quick overview of the variables in the data file.
– It produces a kind of electronic codebook from
the data file, displaying information about
variables' names, labels and values
. codebook
sexh Sex of household head
----------------------------------------------------------------------------
type: numeric (byte)
label: sexhh
range: [0,1] units: 1
unique values: 2 missing .: 0/1452
tabulation: Freq. Numeric Label
400 0 Female
1052 1 Male
Examining dataset
• inspect
– It is another useful command for getting a quick
overview of a data file.
– inspect command displays information about the
values of variables and is useful for checking data
accuracy . inspect sexh
sexh: Sex of household head Number of Observations
---------------------------- Non-
Total Integers Integers
| # Negative - - -
| # Zero 400 400 -
| # Positive 1052 1052 -
| # ----- ----- -----
| # # Total 1452 1452 -
| # # Missing -
+---------------------- -----
0 1 1452
(2 unique values)
sexh is labeled and all values are documented in the label.
Examining dataset
• count
– count command can be used to show the number
of observations that satisfying if options. If no
conditions are specified, count displays the
number of observations in the data.
count
1452
count if q1a==3
466
Examining dataset
Common Stata Syntax
• Stata commands follow the syntax:
[by varilist1:] command [varlist2] [if exp] [in
range] [weight], [options]
• Items inside of the squares brackets are either
options or not available for every command.
• This syntax applies to all Stata commands
Descriptive Statistics
• tabulate, tab1, tab2
–These are three related commands that
produce frequency tables for discrete
variables.
–They can produce one-way or two-way
frequency tables
Descriptive Statistics
• tabulate or tab produce a frequency table
for one or two variables
• tab1 produces a one-way
frequency table for each
variable in the variable list
• tab2 produces all possible two-
variable tables from the
list of variables
Descriptive Statistics
You can use several options with these commands:
• all gives all the tests of association for two-way
tables
• cell gives the overall percentage for two-way
tables
• column gives column percentages for two-way
tables
• row gives row percentages for two-way tables
• nofreq suppresses printing the frequencies.
• chi2 provides the chi squared test for two-way
tables
Descriptive Statistics
Some examples of the tabulate commands are:
• tabulate q1a produces table of frequency by region
• tabulate q1a sexh produces a cross-tab of
frequencies by region and sex of head
• tabulate q1a hhsize, row produces a cross-tab by
region and hhsize with row
percentages
• tabulate sexh hhsize, cell nofreq produces a cross-tab of overall
percent by sex and hhsize.
• tab1 q1a q1b hhsize produces three tables, a
frequency table for each
variable
• tab2 q1a poor sexh produces three tables, a cross-
tab of each pair of variables
Descriptive Statistics
• summarize
– The summarize command produces statistics on continuous variables like age,
food, cons hhsize. The syntax looks like this:
summarize [varlist] [if exp] [in range] [, [detail]]
By default, it produces the following statistics:
• Number of observations
• Average (or mean)
• Standard deviation
• Minimum
• Maximum
If you specify “detail” Stata gives you additional statistics, such as
• skewness,
• kurtosis,
• the four smallest values
• the four largest values
• various percentiles.
Descriptive Statistics
• Here are some examples:
• summarize gives statistics on
all variables
• summarize hhsize food gives statistics on
selected
variables
• summarize hhsize cons if q1a==3 gives statistics on
two variables for
one region
Descriptive Statistics
• bysort
– This prefix goes before a command and asks Stata
to repeat the command for each value of a variable.
The general syntax is:
bysort varlist: command
• Example:
– bysort sex: sum rconsae for sex of hh head, give stats
on real per capita consumption
Descriptive Statistics
• help
– The help command gives you information about any
Stata command or topic
help [command]
For example,
• help tabulate gives a description of
the tabulate command
• help summarize gives a description of the
summarize command
Creating New Variables
• We have seen how to explore the data using
existing variables so far.
• Now we will discuss how to create new
variables.
• When new variables are created, they are in
memory and they will appear in the Data
Browser,
– but they will not be saved on the hard-disk unless
you use the save command
Creating New Variables
• generate
– This command is used to create a new variable. It
is similar to “compute” in SPSS.
• The syntax is;
generate newvar = exp [if exp]
where “exp“ is an expression like
“food/hhsize” or
“20*cons”
Creating New Variables
• The command cannot be used to modify an
existing variable
• You can use “gen“ or “g” as an abbreviation
for “generate“
• If the expression is an equality or inequality,
the variable will take the values 0 if the
expression is false and 1 if it is true
• If you use “if“, the new variable will have
missing values when the “if“ statement is false
Creating New Variables
• For example,
– gen age2 = age*age
• create age squared variable
– gen yield = outputkg/area if area>0
• create new yield variable if area is positive
– gen price = value/quant if quant>0
• create new price variable if quant is positive
– gen smhh= (hhsize<4)
• creates a dummy variable equal to 1 for smaller
households (less than 4 memebrs)
• replace
– This command is used to change the definition of
an existing variable.
• The syntax is the same:
replace oldvar = exp [if exp] [in exp]
Creating New Variables
Creating New Variables
• For example,
replace cons=. if cons<0
replaces negative consumption with missing
value
replace price = avgprice if price > 100000
replaces high values with an average price
replace age = 25 in 1007
replace age=25 in observation #1007
Creating New Variables
• tabulate … generate
– This command is useful for creating a set of
dummy variables (variables with a value of 0 or 1)
depending on the value of an existing categorical
variable.
• The syntax is:
tabulate oldvariable, generate(newvariable)
abs(x) computes the absolute value of x
exp(x) calculates e to the x power.
ln(x) computes the natural logarithm of x
log(x) is a synonym for ln(x), the natural logarithm.
log10(x) computes the log base 10 of x.
sqrt(x) computes the square root of x.
invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z.
normden(z) provides the standard normal density.
normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not
missing, otherwise, the result is missing.
norm(z) provides the cumulative standard normal.
group(x) creates a categorical variable that divides the data into x as nearly equal-
sized subsamples as possible, numbering the first group 1, the second
group 2, etc. It uses the current order of the data.
int(x) gives the integer obtained by truncating x.
round(x,y) gives x rounded into units of y.
Creating New Variables
tab q1a, gen(region)
• This creates 6 new variables:
region1=1 if q1a=1 and 0 otherwise
region2 =1 if q1a =3 and 0 otherwise
……
region6=1 if q1a =8 and 0 otherwise
Creating New Variables
• egen
– This is an extended version of “generate”[extended
generate] to create a new variable by aggregating the
existing data.
• The syntax is:
egen newvar = fcn(arguments) [if exp] [in range] , by(var)
Creating New Variables
• count() number of non-missing
values
• diff() compares variables, 1 if
different, 0 otherwise
• fill() fill with a pattern
• group() creates a group id
from a list of variables
• iqr() interquartile range
• ma() moving average
• max() maximum value
• mean() mean
• median() median
• min() minimum value
• pctile() percentile
• rank () rank
• rmean() mean across
variables
• sd () standard deviation
• std() standardize
variables
• sum () sums
Creating New Variables
• egen avg = mean(cons)
creates variable of average consumption
over entire sample
• egen avg2 = median(cons), by(sex)
creates variable of median consumption
for each sex
• egen regprod = sum(cons), by(q1a)
creates variable of total consumption for
each region
Creating New Variables
• Exercise,
• we want to know which households have
expenditure (cons) above the village average.
• I.e. Create a dummy (1 for those who
consume above the village/peasant
association average and 0 otherwise)
Creating New Variables
• egen avecon=mean(cons), by( q1c)
• gen highavecon=(cons> avecon & cons!=.)
• list hhid q1c cons avecon highavecon in 650/675
Creating New Variables
• Arithmetic
+ addition
- subtraction
* multiplication
/ division
^ power
• Logical
~ not
| or
& and
• Relational
> greater than
< less than
>= more than or equal
<= less than or equal
== equal
~= not equal
!= not equal
Creating New Variables
• Here are some examples to illustrate the use of these
operators. Suppose you want you create a
– dummy variable indicating households in the
Amhara region.
– One way to do it is to run:
generate AmD = 0
replace AmD = 1 if q1a==3
– Or you can get exactly the same result with just:
generate AmD2 = (q1a==3)
compare AmD AmD2
Creating New Variables
• For example, generate a dummy that would
identify observations with male household
heads in Dodota wereda.
gen DDfemale = 0
replace DDfemale = 1 if q1b==9 & sexh==0
or an easier way to do this would be:
gen DDfemale2 = (q1b==9 & sexh==0)
Creating New Variables
• recode
– This command changes the values of a categorical
variable according to the rules specified.
• The syntax is:
recode varname old#=new# old#=new# [if exp] [in range]
Creating New Variables
• Notice that you can use some special symbols
in the rules:
* means all other values
. means missing values
x/y means all values from x to y
x y means x and y
• For example, recode region value 8 and 9 to 7
Creating New Variables
• Here are some examples:
• recode x 1=2 changes all values of x=1 to x= 2
• recode x 1=2 3=4 changes 1 to 2 and 3 to 4
• recode x 1=2 2=1 exchanges the values 1 and 2 in x
• recode x 1=2 *=3 changes 1 in x to 2 and all other
values to 3
• recode x 1/5=2 changes 1 through 5 in x to 2
• recode x 1 3 4 5 = 6 changes 1, 3, 4 and 5 to 6
• recode x .=9 changes missing to 9
• recode x 9=. changes 9 to missing
Creating New Variables
• xtile
– This command creates a new variable that
indicates which category a record falls into, when
the sample is sorted by an existing variable and
divided into “n” groups of equal size.
• Example: xtile can be used to create a variable
that indicates which income quintile a
household belongs to
Creating New Variables
• The syntax is:
xtile newvar = variable [if exp] [in range] , nq(#)
– where newvar is the new categorical variable
created; variable is the existing variable used to
create the quantile (e.g income, farm size); # is the
number of different categories (eg 5 for quintiles,
3 for terciles)
Creating New Variables
• For example,
xtile consq = cons, nq(5)
xtile rconsq = rconsae, nq(10)
Modifying Variables
• We begin with an explanation of how to label
data in Stata. Then see how to format
variables.
– rename variable
– label variable
– label define
– label values
– format variable
Modifying Variables
• rename variables
– This command is used to rename variables in
order to give other variable name.
– The syntax is
rename old_variable new_variable
• Example: Generate a dummy for the region
variable and rename the new dummy
variables accordingly
Modifying Variables
Current
residence • Region
Place of
Birth • Region
Migrate
to • Region
Define lables Save label definitions
Attach the defined
lables to a variable
Label Define
Label
Values
Modifying Variables
• label define
– This command gives a name to a set of value
labels. For example, instead of numbering the
regions, we can assign a label to each region.
• The syntax is:
label define lblname # "label" # "label" # “label“
[, add modify]
Modifying Variables
• label values
– This command attaches named set of value labels
to a categorical variable.
• The syntax is:
label values varname [lblname] [, nofix]
Modifying Variables
label define reg 1"Tigray" 3"Amhara" 4"Oromia"
7"SNNP",modify
label values q1a reg
• Some additional commands that may be
useful in labeling
– label dir to request a list of existing label names
– label list to request a list of all the existing value
labels
– label drop to delete a one or more labels
– label save using to save label definitions as a Do-file
– label data to give a label to a data file
• tabulate … summarize
– This command creates one- and two-way tables
that summarize continuous variables. With the
“summarize” option, we can put means and other
statistics of a continuous variable.
• The syntax is:
tabulate varname1 varname2 [if exp] [in range],
summarize(varname3) options
• You can specify which statistics with options “means”,
“standard” and “freq”
Advanced Descriptive Statistics
• Some examples:
• tab q1a, sum(cons) gives the mean, std
deviation, and frequency of
per capita expenditure for
each region
• tab q1b, sum(cons) mean gives the mean
consumption for each
village
• tab q1a sexh, sum(food) gives the mean, std
deviation, and frequency in
each cell of hh head sex per
region
Advanced Descriptive Statistics
Advanced Descriptive Statistics
• tabstat
– This command gives summary statistics for a set of
continuous variable for each value of a categorical
variable.
• The syntax is:
tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname)
• where
– varlist is a list of continuous variables
– statname is a type of statistic
– varname is a categorical variable
Some facts about this command:
– The default statistic is the mean.
– Optional statistics subcommands include mean, sum, max,
min, range, sd (standard deviation), var (variance),
skewness, kurtosis, median, and pn (nth percentile).
– Without the by() option, tabstat is like “summarize” except
that it allows you to specify the list of statistics to be
displayed.
– With the by() option, tabstat is like "tabulate … summarize
“except that tabstat is more flexible in the statistics and
format
Advanced Descriptive Statistics
• Examples
– tabstat food hhsize, stats(mean max min) gives mean,
max, and min of food & hhsize
– tabstat food hhsize, by(q1a) gives mean of two
variables for each region
– tabstat food, stats(median) by(q1a) gives the median
food consumption
for each region
• The tabstat command displays summary statistics for
a series of numeric variables in a single table.
Advanced Descriptive Statistics
• table
– This command creates a wide variety of tables. It is
probably the most flexible and useful of all the
table commands in Stata.
• The syntax is:
table rowvar colvar [if exp] [in range], c(clist) [row col]
• where
– rowvar is the categorical row variable
– colvar is the categorical column variable
– clist is a list of statistic and variables
– row is an option to include a summary row
– col is an option to include a summary column
Advanced Descriptive Statistics
• Some useful facts about this command:
– The default statistic is the frequency.
– Optional statistics are mean, sd, sum, rawsum
(unweighted), count, max, min, median, and pn
(nth percentile).
– The c( ) is short for contents of each cell.
– Like tab, it can be used to create one- and two-
way frequency tables, but table cannot do
percentages
Advanced Descriptive Statistics
Advanced Descriptive Statistics
• Useful facts (cont.) :
– Like tab…sum, it can be used to calculate basic stats for
each value of a categorical variable
– Its advantage over tab…sum is that it can do more
statistics and it can take more than one continuous
variable
– Like tabstat, it can be used to calculate advanced stats for
each value of a categorical variable
– Its advantage over tabstat is that it can use two (and
more) way tables, but its disadvantage is that it has fewer
statistics.
• Here are some examples:
– table q1a , row table of frequencies by region with total row
– table q1a, c(mean cons) table of average consumption by
region
– table q1a, c(mean food sd food median food) table of food Consumption
statistics by region
– table q1a, c(mean cons) format(%9.2f) table of average consumption
by region with format .
– table q1a sexh, c(mean cons) table of average consumption by
region and sex
– table q1a sexh, c(mean cons mean food) table of avg consumption &
food consumption by region
& sex
Advanced Descriptive Statistics
Presenting Data with Graph
• The commands that draw graphs are
graph twoway scatterplots, line plots,
graph matrix scatterplot matrices
graph bar bar charts
graph dot dot charts
graph box box-and-whisker plots
graph pie pie charts
Presenting Data with Graph
• Examples
graph twoway scatter cons food
• We can show the regression line predicting
cons from food using lfit option.
twoway lfit cons food
• The two graphs can be overlapped like this
twoway (scatter cons hhsize) (lfit cons hhsize)
twoway (scatter cons food) (lfit cons food)
Presenting Data with Graph
• Labeling graphs
scatter var1 var2, title("title") subtitle("subtitle")
xtitle("xtitle") ytitle("ytitle") note("note")
• Example
scatter ageh cons , title("title") subtitle("subtitle")
xtitle("xtitle") ytitle("ytitle") note("note")
Normality and Outlier
• skewness and kurtosis
sum rconspc
sum rconspc, detail
• check normality of a variable visually by
looking at some basic graphs
histogram rconspc
histogram rconspc, normal
histogram rconspc, normal bin(100)
Normality and Outlier
Normality and Outlier
• graph box draws vertical box plots
graph box rconspc, by(sexh)
– y axis is numerical, and the x axis is categorical
– upper and lower bounds of box are defined by the
25th and 75th percentiles
– line within the box is the median
– ends of the whiskers are 5th and 95th percentile
• If rconspc is normal, the median would be in the
center of the box and the end of whiskers would be
equidistant from the box
Normality and Outlier
• The kdensity command with the normal option
kdensity rconspc, normal
– density graph of the residual with a normal distribution
superimposed on the graph
– useful in verifying that the residuals are normally
distributed
• pnorm command produces a P-P plot
pnorm rconspc
– It should be approximately linear if the variable follows
normal distribution
Normality and Outlier
• Qnorm command plots the quantiles of a variable
against the quantiles of a normal distribution
qnorm rconspc
– If the Q-Q plot shows a line that is close to the 45 degree
line, the variable is more normally distributed
• Both P-P and Q-Q plot prove that rconspc is not normal, with
a long tail to the right
• The qnorm plot is more sensitive to deviances from normality
in the tails of the distribution
• The pnorm plot is more sensitive to deviances near the mean
of the distribution
Normality and Outlier
• Dealing with outliers
– We have the following options when we have
outliers
• delete them from analyses
• use measures that are not sensitive to them, such as
median instead of mean
• transform the data to be more normal
• to replace them by imputation
Normality and Outlier
/* Calculate number of standard deviations from median by sex of hh head */
egen median=median(rconspc), by (sexh)
egen sd=sd(rconspc), by (sexh)
*generate the ratio of the deviation from the median to the standard deviation
gen ratio=abs((rconspc-median)/sd)
.
sd
median
rconspc
ratio


Normality and Outlier
*generate an outlier dummy if the value is 3 times the ratio above
gen outlier=1 if ratio>3 & ratio~=.
replace outlier=0 if outlier==. & ratio~=.
tabulate outlier, missing
table sexh outlier, contents(mean rconspc) row col missing
Normality and Outlier
• Listwise deletion
histogram rconspc if outlier==0, normal
• Data transformation
– a log transformation
gen lnrconspc=ln(rconspc)
histogram lnrconspc if rconspc~=., normal
• Imputation
– First the analyst estimates a regression model in which the
dependent variable has missing values
– In the second step, the estimated regression coefficients
are used to predict (impute) missing values of that variable
Normality and Outlier
* Replace outliers to missing
replace rconspc=. If outlier==1
regress lnrconspc i.q1a i.sexh i.poor hhsize ageh, robust
predict yhat
replace lnrconspc=yhat if rconspc==.
• Or
xi: impute lnrconspc i.q1a i.sexh i.poor hhsize ageh, gen(imputed)
Statistical Tests
• compare
– The compare command is an easy way to check if
two variables are the same
compare lnrconspc imputed
• correlate command
– The correlate command displays a matrix of
Pearson correlations for the variable listed.
correlate cons hhsize
correlate cons hhsize, means
pwcorr cons hhsize, sig
Statistical Tests
• ttest command
– If, for example, we like to see if the mean of hhsize
equals to 6 by using single sample t-test, ttest
command is used for this purpose.
ttest hhsize=6
• We can also test if cons and food have the same
mean
ttest cons=food
Statistical Tests
• On the Side – How to interpret the P-values
– Read the p-value for the results
– Convert it to percentage (100*p)
– Now let X=(100*p)
– Decision rule
• If , reject Ho at 1% level of significance
• If , reject Ho at 5% level of significance
• If , reject Ho at 10% level of significance
5
1 
 X
10
5 
 X
1

X
Statistical Tests
• ttest command for independent groups with pooled
(equal) variance
ttest cons, by(sexh)
• ttest command for independent groups using
unequal variance
ttest cons, by(sexh) unequal
• hotelling command performs Hotelling's T-squared
test of whether the means are equal between two
groups.
hotel cons, by(sexh)
Linear Regression
• Regression analysis involves estimating an
equation that best describes the data
• One variable is considered the dependent
variable, while the others are considered
independent (or explanatory) variables
• Stata is capable of many types of regression
analysis and associated statistical test
• Here we touch on only a few of the more
common commands and procedures
• regress
– This is an example of ordinary linear regression by using
regress command.
reg cons hhsize
– This regression tells us that for every extra person (hhsize)
added to a household, total monthly expenditure (cons) will
increase by about 40 Ethiopia Birr
– This increase is statistically significant as indicated by the
0.000 probability associated with this coefficient
Linear Regression
– r-squared (r2) which equals to 0.0676. This value tells us
that our independent variable (hhsize) accounts for
approximately 7% of the variation of dependent variable
(cons)
– Running a regression with robust standard errors will
tolerate a non-zero percentage of outliers, i.e., when the
residuals are not iid
– This is very useful when there is hetroscedasticity of
variance.
– The robust option does not affect the estimates of the
regression coefficients
reg cons hhsize, robust
Linear Regression
– Stata stores results from estimation commands in e(), and
you can see a list of what exactly is stored using the
ereturn list command.
ereturn list
– Using the generate command, we can extract those results,
such as estimated coefficients and standard errors, to be
used in other Stata commands.
• reg cons hhsize
• gen intercept=_b[_cons]
• display intercept
• gen slope=_b[hhsize]
• display slope
Linear Regression
– The estimates table command displays a table
with coefficients and statistics for one or more
estimation sets in parallel columns
estimates store estimatename
estimates table, b se t p
– The predict command computes predicted value
and residual for each observation
predict pred
– When using the resid option the predict command
calculates the residual.
predict e, residual
Linear Regression
– We can plot the predicted value and observed value using
graph twoway command.
regress cons food
predict pred
graph twoway (scatter cons food) (line pred food)
– The rvfplot command generates a plot of the residual
versus the fitted values. It is used after regress command.
regress cons food
rvfplot
– The rvpplot command produces a plot of the residual
versus a specified predictor
rvpplot food
Linear Regression
• Hypothesis tests
– The test command performs Wald tests for simple
and composite linear hypotheses about the
parameters of estimation
recode q1a 7/9=7
gen reg1=q1a==1
gen reg3=q1a==3
gen reg4=q1a==4
gen reg7=q1a==7
regress cons hhsize reg1 reg3 reg4 reg7
Linear Regression
test reg3=0
test reg3= reg4= reg7
– The test command test the hypothesis that region 3
variable is zero (test reg3=0) and all region
variables (region3= region4 = region 7) are zero,
finding that the probability is very low (less than
0.000) so we can reject this hypothesis.
– If you want to test the joint significance of a set of
related variable, you can use
testparm reg* test of hypothesis that all
region dummies are zero
Linear Regression
• Ramsey RESET to test for omitted variables
(misspecification)
ovtest [, rhs]
– This test amounts to estimating y = xb+zt+u and
then testing t=0
regress cons hhsize reg3 reg4 reg7
ovtest tests significance of powers of
predicted cons
ovtest, rhs tests significance of powers of
hhsize, reg3, reg4 and reg7
Linear Regression
• Example;
ovtest
Ramsey RESET test using powers of the fitted values of cons
Ho: model has no omitted variables
F(3, 1441) = 4.47
Prob > F = 0.0039
– The ovtest, reject the hypothesis that there are no
omitted variables, indicated that we need to
improve the specification
Linear Regression
• Heteroskedasticity
– We can use the hettest command to run an
auxiliary regression of on the fitted values.
hettest
Ho: Constant variance
Variables: fitted values of cons
chi2(1) = 81.50
Prob > chi2 = 0.0000
– The hettest indicates that there is
heterorskedasticity which needs to be dealt with
Linear Regression
• We can also use information matrix test by
imtest command, which provides a summary
test of violations of the assumptions on
regression errors.
imtest
• The imtest also approved existence of
heteroskedasticity, skweness and kurtosis
problems
Linear Regression
– The xi prefix is used to dummy code categorical
variables, and we tag these variables with an “i.”
in front of each target variable
xi: regress cons hhsize i.q1a, robust
– By default, Stata selects the first category in the
categorical variable as the reference category. If
we would like to declare a certain category as
reference category
char q1a[omit] 7
xi:regress cons hhsize i.q1a, robust
Linear Regression
– Logistic regression
logistic poor hhsize ageh sexh, coef
xi:logit poor hhsize ageh sexh i.q1b
ereturn list
estat summarize
estat ic
mfx, (options)
– Options
dydx is the default.
eyex specifies that elasticities be calculated in the form of d(lny)/d(lnx)
dyex specifies that elasticities be calculated in the form of d(y)/d(lnx)
eydx specifies that elasticities be calculated in the form of d(lny)/d(x)
Linear Regression
Data Management
• We can subset data by keeping or dropping
variables, or by keeping and dropping
observations
– keep and drop variables
• The keep command is used to keep variables in the list
while dropping other variables
• The drop command is used to delete variables in the
list while keeping other variables
– keep and drop observations
• The keep if command is used to keep observations if
condition is met and vice versa for drop
Data Management
• sort
– The sort command arranges the observations of the
current data into ascending order based on the values of
the variables listed
• Variable ordering
– The order command helps us to organize variables in a way
that makes sense by changing the order of the variables
• by command, _N is the total number of observations
within each group listed in by command, and _n is
the running counter to uniquely identify
observations within the group
Data Management
• Often we don’t have all the info that we need
in one dataset, and we have to merge them
into one (since STATA allows for only one
dataset in memory).
• There are several types of “merging”
datasets…
Data Management
• As long as the variables
in the files are the same
and the only thing you
need to do is to add
observations, this is
vertical combination.
• For this we use the
append command.
• Since this is used less
often, I will skip it, but
you can look at it in the
help file.
Data Management
• Appending data files
– concatenates two datasets, that is, stick them
together vertically, one after another
use tigray.dta, clear
append using amhara.dta
– The append command does not require that the
two datasets contain the same variables. But it
highly recommended to use identical list of
variables for append command to avoid missing
values from one dataset
Data Management
• If the identifying
variable which
appears in the files is
unique in both files,
then it's a one-to-one
match. Unique means
that for each value of
this variable, there is
only one observation
that contains it. In the
figure below, country
is the identifying
variable. In both
datasets, each country
has only one
observation.
Data Management
• One-to-one match merging
• The merge command sticks two datasets horizontally, one next to
the other. Before any merge, both datasets must be sorted by
identical merge variable
use hh_characters.dta, clear
merge 1:1 hhid using consum.dta
Data Management
• One-to-many
matching
– If the identifying
variable is
unique in one
file, but not
unique in the
other, then it's a
one-to-many
matching.
Data Management
• Collapse
– Sometimes we have data files that need to be
aggregated at a higher level to be useful for us.
For example, we have household data but we
really interested in regional data. The collapse
command serves this purpose by converting the
dataset in memory into a dataset of means, sums,
medians and percentiles
• For instance, we would like to see the mean cons in
each q1a and sex of hh head.
collapse (mean) cons, by(q1a sex)
Data Management
• The reshape wide command tells system that
we want to go from long to wide after
collapsing . The i() option records row variable
while j() column variable
reshape wide cons, i(q1a) j(sexh)
Importing Data
• The insheet command can import data in text format (Tab
delimited, or comma separated values CSV files).
• Syntax:
insheet [variable names] using <filename> [,options]
• Options:
– tab : tab-delimited data
– comma : comma-delimited data
– delimiter("char"): use char as delimiter
– clear: replace data in memory
– names : variable names are included on the first line of the file
• Example
cd “…Datafor stata training manual_EEA"
clear
insheet using ERHS_SPSS.csv, comma
Good Sites to Look At!
• STATA HELP – either online or in the software itself.
• http://stataproject.blogspot.com.
• http://www.stata.com/
• http://www.stata.com/statalist/
• http://ideas.repec.org/s/boc/bocode.html
• http://www.princeton.edu/~erp/stata/main.html
• http://www.cpc.unc.edu/services/computer/prese
ntations/statatutorial/
• http://www.ats.ucla.edu/stat/stata/
Good Sites to Look At!
• Statalist is hosted at the
Harvard School of Public
Health, and is an email
listserver where Stata users
including experts writing Stata
programs to users like us
maintain a lively dialogue
about all things statistical and
Stata. You
• can sign on to statalist so that
you can receive as well as
post your own questions
through email.
exit, clear

Contenu connexe

Tendances

Tendances (20)

Introduction to Stata
Introduction to StataIntroduction to Stata
Introduction to Stata
 
Introduction to Generalized Linear Models
Introduction to Generalized Linear ModelsIntroduction to Generalized Linear Models
Introduction to Generalized Linear Models
 
Stata tutorial
Stata tutorialStata tutorial
Stata tutorial
 
SURVIVAL ANALYSIS.ppt
SURVIVAL ANALYSIS.pptSURVIVAL ANALYSIS.ppt
SURVIVAL ANALYSIS.ppt
 
Survival analysis
Survival analysisSurvival analysis
Survival analysis
 
Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statistics
 
Survival analysis
Survival  analysisSurvival  analysis
Survival analysis
 
Categorical data analysis
Categorical data analysisCategorical data analysis
Categorical data analysis
 
Statistics in research
Statistics in researchStatistics in research
Statistics in research
 
What Is the Use of SPSS in Data Analysis
What Is the Use of SPSS in Data AnalysisWhat Is the Use of SPSS in Data Analysis
What Is the Use of SPSS in Data Analysis
 
STATA - Time Series Analysis
STATA - Time Series AnalysisSTATA - Time Series Analysis
STATA - Time Series Analysis
 
Basics of SPSS, Part 2
Basics of SPSS, Part 2Basics of SPSS, Part 2
Basics of SPSS, Part 2
 
Stata statistics
Stata statisticsStata statistics
Stata statistics
 
Data management through spss
Data management through spssData management through spss
Data management through spss
 
Case control study
Case control studyCase control study
Case control study
 
4a. sampling
4a. sampling4a. sampling
4a. sampling
 
Time series analysis in Stata
Time series analysis in StataTime series analysis in Stata
Time series analysis in Stata
 
Ordinal logistic regression
Ordinal logistic regression Ordinal logistic regression
Ordinal logistic regression
 
Point estimation
Point estimationPoint estimation
Point estimation
 
Lecture 6. univariate and bivariate analysis
Lecture 6. univariate and bivariate analysisLecture 6. univariate and bivariate analysis
Lecture 6. univariate and bivariate analysis
 

Similaire à Stata Training_EEA.ppt

Introduction to STATA(2).pdf
Introduction to STATA(2).pdfIntroduction to STATA(2).pdf
Introduction to STATA(2).pdfYomif3
 
STATA_Training_for_data_science_juniors.pdf
STATA_Training_for_data_science_juniors.pdfSTATA_Training_for_data_science_juniors.pdf
STATA_Training_for_data_science_juniors.pdfAronMozart1
 
An introduction to STATA.pdf
An introduction to STATA.pdfAn introduction to STATA.pdf
An introduction to STATA.pdfMd Nain
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2PoguttuezhiniVP
 
AARAV NAYAN OPERATING SYSTEM LABORATORY PCA
AARAV NAYAN OPERATING SYSTEM LABORATORY PCAAARAV NAYAN OPERATING SYSTEM LABORATORY PCA
AARAV NAYAN OPERATING SYSTEM LABORATORY PCAAaravNayan
 
Bozorgmeh os lab
Bozorgmeh os labBozorgmeh os lab
Bozorgmeh os labFS Karimi
 
Dynamics ax performance tuning
Dynamics ax performance tuningDynamics ax performance tuning
Dynamics ax performance tuningOutsourceAX
 
MS SQL Server.ppt
MS SQL Server.pptMS SQL Server.ppt
MS SQL Server.pptQuyVo27
 
CMake Tutorial
CMake TutorialCMake Tutorial
CMake TutorialFu Haiping
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slidesmetsarin
 
Inspection and maintenance tools (Linux / OpenStack)
Inspection and maintenance tools (Linux / OpenStack)Inspection and maintenance tools (Linux / OpenStack)
Inspection and maintenance tools (Linux / OpenStack)Gerard Braad
 

Similaire à Stata Training_EEA.ppt (20)

Introduction to STATA(2).pdf
Introduction to STATA(2).pdfIntroduction to STATA(2).pdf
Introduction to STATA(2).pdf
 
introduction-stata.pptx
introduction-stata.pptxintroduction-stata.pptx
introduction-stata.pptx
 
STATA_Training_for_data_science_juniors.pdf
STATA_Training_for_data_science_juniors.pdfSTATA_Training_for_data_science_juniors.pdf
STATA_Training_for_data_science_juniors.pdf
 
Linux
LinuxLinux
Linux
 
StataTutorial.pdf
StataTutorial.pdfStataTutorial.pdf
StataTutorial.pdf
 
An introduction to STATA.pdf
An introduction to STATA.pdfAn introduction to STATA.pdf
An introduction to STATA.pdf
 
Stata tutorial university of princeton
Stata tutorial university of princetonStata tutorial university of princeton
Stata tutorial university of princeton
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
AARAV NAYAN OPERATING SYSTEM LABORATORY PCA
AARAV NAYAN OPERATING SYSTEM LABORATORY PCAAARAV NAYAN OPERATING SYSTEM LABORATORY PCA
AARAV NAYAN OPERATING SYSTEM LABORATORY PCA
 
Bozorgmeh os lab
Bozorgmeh os labBozorgmeh os lab
Bozorgmeh os lab
 
Sas - Introduction to working under change management
Sas - Introduction to working under change managementSas - Introduction to working under change management
Sas - Introduction to working under change management
 
Dynamics ax performance tuning
Dynamics ax performance tuningDynamics ax performance tuning
Dynamics ax performance tuning
 
Linux
LinuxLinux
Linux
 
Group13
Group13Group13
Group13
 
MS SQL Server.ppt
MS SQL Server.pptMS SQL Server.ppt
MS SQL Server.ppt
 
Basics.ppt
Basics.pptBasics.ppt
Basics.ppt
 
CMake Tutorial
CMake TutorialCMake Tutorial
CMake Tutorial
 
Aggregate.pptx
Aggregate.pptxAggregate.pptx
Aggregate.pptx
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
 
Inspection and maintenance tools (Linux / OpenStack)
Inspection and maintenance tools (Linux / OpenStack)Inspection and maintenance tools (Linux / OpenStack)
Inspection and maintenance tools (Linux / OpenStack)
 

Plus de selam49

Yom_DATA MANAGEMENT.ppt
Yom_DATA MANAGEMENT.pptYom_DATA MANAGEMENT.ppt
Yom_DATA MANAGEMENT.pptselam49
 
Performance appraisal CH.7.ppt
Performance appraisal CH.7.pptPerformance appraisal CH.7.ppt
Performance appraisal CH.7.pptselam49
 
Integration and maintenance chapter-9.ppt
Integration and maintenance chapter-9.pptIntegration and maintenance chapter-9.ppt
Integration and maintenance chapter-9.pptselam49
 
Compensation ch 8.ppt
Compensation ch 8.pptCompensation ch 8.ppt
Compensation ch 8.pptselam49
 
CH-1&2-introduction-of-hrm...ppt
CH-1&2-introduction-of-hrm...pptCH-1&2-introduction-of-hrm...ppt
CH-1&2-introduction-of-hrm...pptselam49
 
MBA UNIT VII PROJECT FINANCE-1.doc
MBA UNIT VII PROJECT FINANCE-1.docMBA UNIT VII PROJECT FINANCE-1.doc
MBA UNIT VII PROJECT FINANCE-1.docselam49
 
CONTENT 3.doc
CONTENT 3.docCONTENT 3.doc
CONTENT 3.docselam49
 
CONTENT 2.doc
CONTENT 2.docCONTENT 2.doc
CONTENT 2.docselam49
 
MBA UNIT 7 PROJECT FINANCE.doc
MBA UNIT 7 PROJECT FINANCE.docMBA UNIT 7 PROJECT FINANCE.doc
MBA UNIT 7 PROJECT FINANCE.docselam49
 
MBA UNIT 6.pptx
MBA UNIT 6.pptxMBA UNIT 6.pptx
MBA UNIT 6.pptxselam49
 
MBA UNIT 5.pptx
MBA UNIT 5.pptxMBA UNIT 5.pptx
MBA UNIT 5.pptxselam49
 
MBA UNIT 4.pptx
MBA UNIT 4.pptxMBA UNIT 4.pptx
MBA UNIT 4.pptxselam49
 
MBA UNIT 3.pptx
MBA UNIT 3.pptxMBA UNIT 3.pptx
MBA UNIT 3.pptxselam49
 
MBA UNIT 2.pptx
MBA UNIT 2.pptxMBA UNIT 2.pptx
MBA UNIT 2.pptxselam49
 
MBA UNIT 1.pptx
MBA UNIT 1.pptxMBA UNIT 1.pptx
MBA UNIT 1.pptxselam49
 
Chapter 3 Final.ppt
Chapter 3 Final.pptChapter 3 Final.ppt
Chapter 3 Final.pptselam49
 
Chapter 2 Final.ppt
Chapter 2 Final.pptChapter 2 Final.ppt
Chapter 2 Final.pptselam49
 
Chapter 1 Final.ppt
Chapter 1 Final.pptChapter 1 Final.ppt
Chapter 1 Final.pptselam49
 
Leadership for RRS - 2014.pdf
Leadership for RRS - 2014.pdfLeadership for RRS - 2014.pdf
Leadership for RRS - 2014.pdfselam49
 
BSC for Refugees and Returnees Service Ginbot 2014.pptx
BSC for Refugees and Returnees Service Ginbot 2014.pptxBSC for Refugees and Returnees Service Ginbot 2014.pptx
BSC for Refugees and Returnees Service Ginbot 2014.pptxselam49
 

Plus de selam49 (20)

Yom_DATA MANAGEMENT.ppt
Yom_DATA MANAGEMENT.pptYom_DATA MANAGEMENT.ppt
Yom_DATA MANAGEMENT.ppt
 
Performance appraisal CH.7.ppt
Performance appraisal CH.7.pptPerformance appraisal CH.7.ppt
Performance appraisal CH.7.ppt
 
Integration and maintenance chapter-9.ppt
Integration and maintenance chapter-9.pptIntegration and maintenance chapter-9.ppt
Integration and maintenance chapter-9.ppt
 
Compensation ch 8.ppt
Compensation ch 8.pptCompensation ch 8.ppt
Compensation ch 8.ppt
 
CH-1&2-introduction-of-hrm...ppt
CH-1&2-introduction-of-hrm...pptCH-1&2-introduction-of-hrm...ppt
CH-1&2-introduction-of-hrm...ppt
 
MBA UNIT VII PROJECT FINANCE-1.doc
MBA UNIT VII PROJECT FINANCE-1.docMBA UNIT VII PROJECT FINANCE-1.doc
MBA UNIT VII PROJECT FINANCE-1.doc
 
CONTENT 3.doc
CONTENT 3.docCONTENT 3.doc
CONTENT 3.doc
 
CONTENT 2.doc
CONTENT 2.docCONTENT 2.doc
CONTENT 2.doc
 
MBA UNIT 7 PROJECT FINANCE.doc
MBA UNIT 7 PROJECT FINANCE.docMBA UNIT 7 PROJECT FINANCE.doc
MBA UNIT 7 PROJECT FINANCE.doc
 
MBA UNIT 6.pptx
MBA UNIT 6.pptxMBA UNIT 6.pptx
MBA UNIT 6.pptx
 
MBA UNIT 5.pptx
MBA UNIT 5.pptxMBA UNIT 5.pptx
MBA UNIT 5.pptx
 
MBA UNIT 4.pptx
MBA UNIT 4.pptxMBA UNIT 4.pptx
MBA UNIT 4.pptx
 
MBA UNIT 3.pptx
MBA UNIT 3.pptxMBA UNIT 3.pptx
MBA UNIT 3.pptx
 
MBA UNIT 2.pptx
MBA UNIT 2.pptxMBA UNIT 2.pptx
MBA UNIT 2.pptx
 
MBA UNIT 1.pptx
MBA UNIT 1.pptxMBA UNIT 1.pptx
MBA UNIT 1.pptx
 
Chapter 3 Final.ppt
Chapter 3 Final.pptChapter 3 Final.ppt
Chapter 3 Final.ppt
 
Chapter 2 Final.ppt
Chapter 2 Final.pptChapter 2 Final.ppt
Chapter 2 Final.ppt
 
Chapter 1 Final.ppt
Chapter 1 Final.pptChapter 1 Final.ppt
Chapter 1 Final.ppt
 
Leadership for RRS - 2014.pdf
Leadership for RRS - 2014.pdfLeadership for RRS - 2014.pdf
Leadership for RRS - 2014.pdf
 
BSC for Refugees and Returnees Service Ginbot 2014.pptx
BSC for Refugees and Returnees Service Ginbot 2014.pptxBSC for Refugees and Returnees Service Ginbot 2014.pptx
BSC for Refugees and Returnees Service Ginbot 2014.pptx
 

Dernier

(中央兰开夏大学毕业证学位证成绩单-案例)
(中央兰开夏大学毕业证学位证成绩单-案例)(中央兰开夏大学毕业证学位证成绩单-案例)
(中央兰开夏大学毕业证学位证成绩单-案例)twfkn8xj
 
Lundin Gold April 2024 Corporate Presentation v4.pdf
Lundin Gold April 2024 Corporate Presentation v4.pdfLundin Gold April 2024 Corporate Presentation v4.pdf
Lundin Gold April 2024 Corporate Presentation v4.pdfAdnet Communications
 
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170Sonam Pathan
 
Tenets of Physiocracy History of Economic
Tenets of Physiocracy History of EconomicTenets of Physiocracy History of Economic
Tenets of Physiocracy History of Economiccinemoviesu
 
Stock Market Brief Deck for 4/24/24 .pdf
Stock Market Brief Deck for 4/24/24 .pdfStock Market Brief Deck for 4/24/24 .pdf
Stock Market Brief Deck for 4/24/24 .pdfMichael Silva
 
Stock Market Brief Deck for "this does not happen often".pdf
Stock Market Brief Deck for "this does not happen often".pdfStock Market Brief Deck for "this does not happen often".pdf
Stock Market Brief Deck for "this does not happen often".pdfMichael Silva
 
BPPG response - Options for Defined Benefit schemes - 19Apr24.pdf
BPPG response - Options for Defined Benefit schemes - 19Apr24.pdfBPPG response - Options for Defined Benefit schemes - 19Apr24.pdf
BPPG response - Options for Defined Benefit schemes - 19Apr24.pdfHenry Tapper
 
The Core Functions of the Bangko Sentral ng Pilipinas
The Core Functions of the Bangko Sentral ng PilipinasThe Core Functions of the Bangko Sentral ng Pilipinas
The Core Functions of the Bangko Sentral ng PilipinasCherylouCamus
 
Bladex 1Q24 Earning Results Presentation
Bladex 1Q24 Earning Results PresentationBladex 1Q24 Earning Results Presentation
Bladex 1Q24 Earning Results PresentationBladex
 
Monthly Market Risk Update: April 2024 [SlideShare]
Monthly Market Risk Update: April 2024 [SlideShare]Monthly Market Risk Update: April 2024 [SlideShare]
Monthly Market Risk Update: April 2024 [SlideShare]Commonwealth
 
How Automation is Driving Efficiency Through the Last Mile of Reporting
How Automation is Driving Efficiency Through the Last Mile of ReportingHow Automation is Driving Efficiency Through the Last Mile of Reporting
How Automation is Driving Efficiency Through the Last Mile of ReportingAggregage
 
Current Economic situation of Pakistan .pptx
Current Economic situation of Pakistan .pptxCurrent Economic situation of Pakistan .pptx
Current Economic situation of Pakistan .pptxuzma244191
 
letter-from-the-chair-to-the-fca-relating-to-british-steel-pensions-scheme-15...
letter-from-the-chair-to-the-fca-relating-to-british-steel-pensions-scheme-15...letter-from-the-chair-to-the-fca-relating-to-british-steel-pensions-scheme-15...
letter-from-the-chair-to-the-fca-relating-to-british-steel-pensions-scheme-15...Henry Tapper
 
Call Girls In Yusuf Sarai Women Seeking Men 9654467111
Call Girls In Yusuf Sarai Women Seeking Men 9654467111Call Girls In Yusuf Sarai Women Seeking Men 9654467111
Call Girls In Yusuf Sarai Women Seeking Men 9654467111Sapana Sha
 
Ch 4 investment Intermediate financial Accounting
Ch 4 investment Intermediate financial AccountingCh 4 investment Intermediate financial Accounting
Ch 4 investment Intermediate financial AccountingAbdi118682
 
Call Girls Near Me WhatsApp:+91-9833363713
Call Girls Near Me WhatsApp:+91-9833363713Call Girls Near Me WhatsApp:+91-9833363713
Call Girls Near Me WhatsApp:+91-9833363713Sonam Pathan
 
Stock Market Brief Deck FOR 4/17 video.pdf
Stock Market Brief Deck FOR 4/17 video.pdfStock Market Brief Deck FOR 4/17 video.pdf
Stock Market Brief Deck FOR 4/17 video.pdfMichael Silva
 
call girls in Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in  Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in  Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Quantitative Analysis of Retail Sector Companies
Quantitative Analysis of Retail Sector CompaniesQuantitative Analysis of Retail Sector Companies
Quantitative Analysis of Retail Sector Companiesprashantbhati354
 
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance CompanyInterimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance CompanyTyöeläkeyhtiö Elo
 

Dernier (20)

(中央兰开夏大学毕业证学位证成绩单-案例)
(中央兰开夏大学毕业证学位证成绩单-案例)(中央兰开夏大学毕业证学位证成绩单-案例)
(中央兰开夏大学毕业证学位证成绩单-案例)
 
Lundin Gold April 2024 Corporate Presentation v4.pdf
Lundin Gold April 2024 Corporate Presentation v4.pdfLundin Gold April 2024 Corporate Presentation v4.pdf
Lundin Gold April 2024 Corporate Presentation v4.pdf
 
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170
 
Tenets of Physiocracy History of Economic
Tenets of Physiocracy History of EconomicTenets of Physiocracy History of Economic
Tenets of Physiocracy History of Economic
 
Stock Market Brief Deck for 4/24/24 .pdf
Stock Market Brief Deck for 4/24/24 .pdfStock Market Brief Deck for 4/24/24 .pdf
Stock Market Brief Deck for 4/24/24 .pdf
 
Stock Market Brief Deck for "this does not happen often".pdf
Stock Market Brief Deck for "this does not happen often".pdfStock Market Brief Deck for "this does not happen often".pdf
Stock Market Brief Deck for "this does not happen often".pdf
 
BPPG response - Options for Defined Benefit schemes - 19Apr24.pdf
BPPG response - Options for Defined Benefit schemes - 19Apr24.pdfBPPG response - Options for Defined Benefit schemes - 19Apr24.pdf
BPPG response - Options for Defined Benefit schemes - 19Apr24.pdf
 
The Core Functions of the Bangko Sentral ng Pilipinas
The Core Functions of the Bangko Sentral ng PilipinasThe Core Functions of the Bangko Sentral ng Pilipinas
The Core Functions of the Bangko Sentral ng Pilipinas
 
Bladex 1Q24 Earning Results Presentation
Bladex 1Q24 Earning Results PresentationBladex 1Q24 Earning Results Presentation
Bladex 1Q24 Earning Results Presentation
 
Monthly Market Risk Update: April 2024 [SlideShare]
Monthly Market Risk Update: April 2024 [SlideShare]Monthly Market Risk Update: April 2024 [SlideShare]
Monthly Market Risk Update: April 2024 [SlideShare]
 
How Automation is Driving Efficiency Through the Last Mile of Reporting
How Automation is Driving Efficiency Through the Last Mile of ReportingHow Automation is Driving Efficiency Through the Last Mile of Reporting
How Automation is Driving Efficiency Through the Last Mile of Reporting
 
Current Economic situation of Pakistan .pptx
Current Economic situation of Pakistan .pptxCurrent Economic situation of Pakistan .pptx
Current Economic situation of Pakistan .pptx
 
letter-from-the-chair-to-the-fca-relating-to-british-steel-pensions-scheme-15...
letter-from-the-chair-to-the-fca-relating-to-british-steel-pensions-scheme-15...letter-from-the-chair-to-the-fca-relating-to-british-steel-pensions-scheme-15...
letter-from-the-chair-to-the-fca-relating-to-british-steel-pensions-scheme-15...
 
Call Girls In Yusuf Sarai Women Seeking Men 9654467111
Call Girls In Yusuf Sarai Women Seeking Men 9654467111Call Girls In Yusuf Sarai Women Seeking Men 9654467111
Call Girls In Yusuf Sarai Women Seeking Men 9654467111
 
Ch 4 investment Intermediate financial Accounting
Ch 4 investment Intermediate financial AccountingCh 4 investment Intermediate financial Accounting
Ch 4 investment Intermediate financial Accounting
 
Call Girls Near Me WhatsApp:+91-9833363713
Call Girls Near Me WhatsApp:+91-9833363713Call Girls Near Me WhatsApp:+91-9833363713
Call Girls Near Me WhatsApp:+91-9833363713
 
Stock Market Brief Deck FOR 4/17 video.pdf
Stock Market Brief Deck FOR 4/17 video.pdfStock Market Brief Deck FOR 4/17 video.pdf
Stock Market Brief Deck FOR 4/17 video.pdf
 
call girls in Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in  Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in  Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Quantitative Analysis of Retail Sector Companies
Quantitative Analysis of Retail Sector CompaniesQuantitative Analysis of Retail Sector Companies
Quantitative Analysis of Retail Sector Companies
 
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance CompanyInterimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
 

Stata Training_EEA.ppt

  • 1. Introduction to the Statistical Software Stata (Release 12)
  • 2. Outline • The Stata Platform • Storing Commands and Output • Examining dataset • Descriptive Statistics • Creating and Modifying Variables • Advanced Descriptive Statistics • Presenting Data with Graph • Normality and Outlier • Statistical Tests • Linear Regression • Data Management
  • 5.
  • 6.
  • 7.
  • 8. Housekeeping Commands • The Global macros – Here we use it to store file locations (but it has many other uses) • We can define the path of our file using global mydata " D:...Data” • Whenever we need to refer to this path we can write $mydata
  • 9. Housekeeping Commands • The cd (Change Directory) command – On its own, identifies the current working directory – Followed by a path, changes the current working directory to the one on the path cd "D:...Data” Or cd “$mydata”
  • 10. Storing Commands and Output • The following topics are covered: – Using the Do-file Editor – log using – log off – log on – log close – set logtype to move tables from Stata to Word and Excel
  • 11. Storing Commands and Output • Using the Do-file Editor – The Do-file Editor allows you to store a program (a set of commands), – It makes checking and fixing errors easier, – It allows you to run the commands later, – It lets you share your procedures with collaborators or reviewers, and – It allows you to collaborate with others on the analysis.
  • 12. Storing Commands and Output • Any time you are running more than 10 commands, it is easier and safer to use a Do- file to store the commands • To open the Do-file Editor, you can – click on Windows/Do-file Editor or – click on the icon on the Tool Bar.
  • 13. Storing Commands and Output • keyboard commands are quicker to use than the buttons. The most useful ones are: • Control-O Open file • Control-S Save file • Control-C Copy • Control-X Cut • Control-V Paste • Control-Z Undo • Control-F Find • Control-H Find and Replace
  • 14. Storing Commands and Output • Adding comments to a do-file – To add comment on a single line * We can put an asterisk and write the command – To add a comment in multiple lines /* open a bracket like this and end it by closing the bracket like this */ –To add a comment after a command Command // write the comment after 2 slashes
  • 15. Storing Commands and Output • To run the commands in a Do-file, – you can click on the Do button or – click on Tools/Do or – Use Ctrl+D – If you want to run one or just a few commands rather than the whole file, mark the commands and click on the Do button
  • 16. Storing Commands and Output • Saving the Output – Stata Results window does not keep all the output you generate. – when it is full, it begins to delete the old results as you add new results. – Thus, we need to use log to save the output
  • 17. Storing Commands and Output • log using – This command creates a file with a copy of all the commands and output from Stata. The syntax is: log using filename [, append replace [ text | smcl ] ] • append adds the output to an existing file • replace replaces an existing file with the output • text tells Stata to create the log file in text (ASCII) format • smcl tells Stata to create the log file in SMCL format
  • 18. Storing Commands and Output • Here are some examples: – log using temp22 saves output to a file called temp22 – log using temp22, replace saves output to an existing file, temp20, replacing content – log using temp22, append saves output to an existing file, results, adding to contents – log using “$mydatamyfile”,replace saves output in specified file in specified folder
  • 19. Storing Commands and Output • log off – This command temporarily turns off the logging of output, • log on – This command is used to restart the logging, • log close – This command is used to turn off the logging and save the file.
  • 20. Storing Commands and Output • set logtype text – This command tells Stata to always save the log files in text (ASCII) format • set logtype smcl – This command tells Stata to always save log files in SMCL format.
  • 21. Examining dataset • clear – The clear command deletes all files, variables, and labels from the memory to get ready to use a new data file – You can clear memory using the clear command or by using it as part of the use command – This command does not delete any data saved to the hard-drive
  • 22. Examining dataset • set memory – First you can check to see how much memory is allocated to hold your data using the memory command – By default we have 11MB free for reading in a data file. – Whenever we want to read data file bigger than this free bytes, we will get the error message read as: no room to add more observations r(901);
  • 23. Examining dataset – In this case we have to allocate to more memory, say 25MB (if 25MB are sufficient for current file), with the set memory command before trying to use our file. set memory 25m – Now that we have allocated enough memory, we will be able to read bigger files provided that it is within the specified memory spaces – If we want to allocate 25m (25 megabytes) every time we start Stata, We can type; set memory 25m, permanently
  • 24. Examining dataset • use – This command opens an existing Stata data file. • The syntax is: use filename [, clear ] opens the file ‘filename’ use [varlist] [if exp] [in range] using filename [, clear ] opens selected parts of file – If there is no path, Stata assumes it is in the current folder. – You can use a path name such as: use C:...ERHScons1999 – If the path name has spaces, you must use double quotes: use .”d:my dataERHScons1999”
  • 25. • Logical operators used in Stata ~ Not == Equal ~= not equal != not equal > greater than >= greater than or equal < less than <= less than or equal & And | Or Examining dataset
  • 26. Examining dataset Here are some examples on the use command: • use ERHScons1999 opens the file ERHScons1999.dta for analysis. • use ERHScons1999 if q1a == 1 opens data from region 1 • use ERHScons1999 in 5/25 opens records 5 through 25 of file • use hhsize cons using ERHScons1999 opens 3 variables from ERHScons1999 file • use C:training ERHScons1999 opens the file ERHScons1999.dta in the specified folder • use “$mydata ERHScons1999” use quotation marks if there are spaces • use ERHScons1999, clear clears memory before opening the new file
  • 27. Examining dataset • save – The save command will save the dataset as a .dta file under the name you choose. Open a subset of a dataset (for region 1 = Tigray only) use erhscons1999 if q1a==1, clear Save this data as a new file with the name tigray save tigray, replace • The replace option allows you to save a changed file to the disk, replacing the original file. Stata is worried that you will accidentally overwrite your data file. You need to use the replace option to tell Stata that you know that the file exists and you want to replace it.
  • 28. Examining dataset • Open the training dataset use ERHScons1999, clear • edit – This command is used to open the data editor window that allow us to view observations as a spreadsheet – You can change the data using data editor window but it is not recommend to edit data using this window – It is better to correct errors in the data using a Do-file program that can be saved
  • 29. • browse – This window is exactly like the data editor window, except that you can’t change the data in this case • describe – This command provides a brief description of the data file. You can use “des” or “d” as a short hand for describe. – The output includes: • the number of variables • the number of observations (records) • the size of the file • the list of variables and their characteristics Examining dataset
  • 30.
  • 31. Examining dataset • list – This command lists values of variables in data set. The syntax is: list [varlist] [if exp] [in range] • examples: – list lists entire dataset – list in 1/10 lists observations 1 through 10 – list hhsize q1a food lists selected variables – list hhsize sex in 1/20 lists observations 1-20 for selected variables – list if q1a < 6 lists cases in region is 1 through 5
  • 32. Examining dataset • if – This command is used to select certain records in carrying out a command command if exp Examples: – list hhid q1a food if food >1200 lists data if food is above 1200 – tab q1a if cons>1000 &cons<2000 frequency table of region if consumption is in range – summarize food if q1a==3 | q1a==4 statistics on food Consumption for regions 3 and 4 – browse hhid q1a food if food >=1200 browse data if food consumption is above 1200 • Note that “if” statements always use ==, not a single =
  • 33. Examining dataset • in – We have also used in to select records based on the case number. – The syntax is: command in exp For example: • list in 10 list observation number 10 • summarize in 10/20 summarize observations 10-20 • l in -10/-1 list the last 10 observations
  • 34. Examining dataset • codebook – The codebook command is a great tool for getting a quick overview of the variables in the data file. – It produces a kind of electronic codebook from the data file, displaying information about variables' names, labels and values . codebook sexh Sex of household head ---------------------------------------------------------------------------- type: numeric (byte) label: sexhh range: [0,1] units: 1 unique values: 2 missing .: 0/1452 tabulation: Freq. Numeric Label 400 0 Female 1052 1 Male
  • 35. Examining dataset • inspect – It is another useful command for getting a quick overview of a data file. – inspect command displays information about the values of variables and is useful for checking data accuracy . inspect sexh sexh: Sex of household head Number of Observations ---------------------------- Non- Total Integers Integers | # Negative - - - | # Zero 400 400 - | # Positive 1052 1052 - | # ----- ----- ----- | # # Total 1452 1452 - | # # Missing - +---------------------- ----- 0 1 1452 (2 unique values) sexh is labeled and all values are documented in the label.
  • 36. Examining dataset • count – count command can be used to show the number of observations that satisfying if options. If no conditions are specified, count displays the number of observations in the data. count 1452 count if q1a==3 466
  • 37. Examining dataset Common Stata Syntax • Stata commands follow the syntax: [by varilist1:] command [varlist2] [if exp] [in range] [weight], [options] • Items inside of the squares brackets are either options or not available for every command. • This syntax applies to all Stata commands
  • 38. Descriptive Statistics • tabulate, tab1, tab2 –These are three related commands that produce frequency tables for discrete variables. –They can produce one-way or two-way frequency tables
  • 39. Descriptive Statistics • tabulate or tab produce a frequency table for one or two variables • tab1 produces a one-way frequency table for each variable in the variable list • tab2 produces all possible two- variable tables from the list of variables
  • 40. Descriptive Statistics You can use several options with these commands: • all gives all the tests of association for two-way tables • cell gives the overall percentage for two-way tables • column gives column percentages for two-way tables • row gives row percentages for two-way tables • nofreq suppresses printing the frequencies. • chi2 provides the chi squared test for two-way tables
  • 41. Descriptive Statistics Some examples of the tabulate commands are: • tabulate q1a produces table of frequency by region • tabulate q1a sexh produces a cross-tab of frequencies by region and sex of head • tabulate q1a hhsize, row produces a cross-tab by region and hhsize with row percentages • tabulate sexh hhsize, cell nofreq produces a cross-tab of overall percent by sex and hhsize. • tab1 q1a q1b hhsize produces three tables, a frequency table for each variable • tab2 q1a poor sexh produces three tables, a cross- tab of each pair of variables
  • 42. Descriptive Statistics • summarize – The summarize command produces statistics on continuous variables like age, food, cons hhsize. The syntax looks like this: summarize [varlist] [if exp] [in range] [, [detail]] By default, it produces the following statistics: • Number of observations • Average (or mean) • Standard deviation • Minimum • Maximum If you specify “detail” Stata gives you additional statistics, such as • skewness, • kurtosis, • the four smallest values • the four largest values • various percentiles.
  • 43. Descriptive Statistics • Here are some examples: • summarize gives statistics on all variables • summarize hhsize food gives statistics on selected variables • summarize hhsize cons if q1a==3 gives statistics on two variables for one region
  • 44. Descriptive Statistics • bysort – This prefix goes before a command and asks Stata to repeat the command for each value of a variable. The general syntax is: bysort varlist: command • Example: – bysort sex: sum rconsae for sex of hh head, give stats on real per capita consumption
  • 45. Descriptive Statistics • help – The help command gives you information about any Stata command or topic help [command] For example, • help tabulate gives a description of the tabulate command • help summarize gives a description of the summarize command
  • 46. Creating New Variables • We have seen how to explore the data using existing variables so far. • Now we will discuss how to create new variables. • When new variables are created, they are in memory and they will appear in the Data Browser, – but they will not be saved on the hard-disk unless you use the save command
  • 47. Creating New Variables • generate – This command is used to create a new variable. It is similar to “compute” in SPSS. • The syntax is; generate newvar = exp [if exp] where “exp“ is an expression like “food/hhsize” or “20*cons”
  • 48. Creating New Variables • The command cannot be used to modify an existing variable • You can use “gen“ or “g” as an abbreviation for “generate“ • If the expression is an equality or inequality, the variable will take the values 0 if the expression is false and 1 if it is true • If you use “if“, the new variable will have missing values when the “if“ statement is false
  • 49. Creating New Variables • For example, – gen age2 = age*age • create age squared variable – gen yield = outputkg/area if area>0 • create new yield variable if area is positive – gen price = value/quant if quant>0 • create new price variable if quant is positive – gen smhh= (hhsize<4) • creates a dummy variable equal to 1 for smaller households (less than 4 memebrs)
  • 50. • replace – This command is used to change the definition of an existing variable. • The syntax is the same: replace oldvar = exp [if exp] [in exp] Creating New Variables
  • 51. Creating New Variables • For example, replace cons=. if cons<0 replaces negative consumption with missing value replace price = avgprice if price > 100000 replaces high values with an average price replace age = 25 in 1007 replace age=25 in observation #1007
  • 52. Creating New Variables • tabulate … generate – This command is useful for creating a set of dummy variables (variables with a value of 0 or 1) depending on the value of an existing categorical variable. • The syntax is: tabulate oldvariable, generate(newvariable)
  • 53. abs(x) computes the absolute value of x exp(x) calculates e to the x power. ln(x) computes the natural logarithm of x log(x) is a synonym for ln(x), the natural logarithm. log10(x) computes the log base 10 of x. sqrt(x) computes the square root of x. invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z. normden(z) provides the standard normal density. normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not missing, otherwise, the result is missing. norm(z) provides the cumulative standard normal. group(x) creates a categorical variable that divides the data into x as nearly equal- sized subsamples as possible, numbering the first group 1, the second group 2, etc. It uses the current order of the data. int(x) gives the integer obtained by truncating x. round(x,y) gives x rounded into units of y.
  • 54. Creating New Variables tab q1a, gen(region) • This creates 6 new variables: region1=1 if q1a=1 and 0 otherwise region2 =1 if q1a =3 and 0 otherwise …… region6=1 if q1a =8 and 0 otherwise
  • 55. Creating New Variables • egen – This is an extended version of “generate”[extended generate] to create a new variable by aggregating the existing data. • The syntax is: egen newvar = fcn(arguments) [if exp] [in range] , by(var)
  • 56. Creating New Variables • count() number of non-missing values • diff() compares variables, 1 if different, 0 otherwise • fill() fill with a pattern • group() creates a group id from a list of variables • iqr() interquartile range • ma() moving average • max() maximum value • mean() mean • median() median • min() minimum value • pctile() percentile • rank () rank • rmean() mean across variables • sd () standard deviation • std() standardize variables • sum () sums
  • 57. Creating New Variables • egen avg = mean(cons) creates variable of average consumption over entire sample • egen avg2 = median(cons), by(sex) creates variable of median consumption for each sex • egen regprod = sum(cons), by(q1a) creates variable of total consumption for each region
  • 58. Creating New Variables • Exercise, • we want to know which households have expenditure (cons) above the village average. • I.e. Create a dummy (1 for those who consume above the village/peasant association average and 0 otherwise)
  • 59. Creating New Variables • egen avecon=mean(cons), by( q1c) • gen highavecon=(cons> avecon & cons!=.) • list hhid q1c cons avecon highavecon in 650/675
  • 60. Creating New Variables • Arithmetic + addition - subtraction * multiplication / division ^ power • Logical ~ not | or & and • Relational > greater than < less than >= more than or equal <= less than or equal == equal ~= not equal != not equal
  • 61. Creating New Variables • Here are some examples to illustrate the use of these operators. Suppose you want you create a – dummy variable indicating households in the Amhara region. – One way to do it is to run: generate AmD = 0 replace AmD = 1 if q1a==3 – Or you can get exactly the same result with just: generate AmD2 = (q1a==3) compare AmD AmD2
  • 62. Creating New Variables • For example, generate a dummy that would identify observations with male household heads in Dodota wereda. gen DDfemale = 0 replace DDfemale = 1 if q1b==9 & sexh==0 or an easier way to do this would be: gen DDfemale2 = (q1b==9 & sexh==0)
  • 63. Creating New Variables • recode – This command changes the values of a categorical variable according to the rules specified. • The syntax is: recode varname old#=new# old#=new# [if exp] [in range]
  • 64. Creating New Variables • Notice that you can use some special symbols in the rules: * means all other values . means missing values x/y means all values from x to y x y means x and y • For example, recode region value 8 and 9 to 7
  • 65. Creating New Variables • Here are some examples: • recode x 1=2 changes all values of x=1 to x= 2 • recode x 1=2 3=4 changes 1 to 2 and 3 to 4 • recode x 1=2 2=1 exchanges the values 1 and 2 in x • recode x 1=2 *=3 changes 1 in x to 2 and all other values to 3 • recode x 1/5=2 changes 1 through 5 in x to 2 • recode x 1 3 4 5 = 6 changes 1, 3, 4 and 5 to 6 • recode x .=9 changes missing to 9 • recode x 9=. changes 9 to missing
  • 66. Creating New Variables • xtile – This command creates a new variable that indicates which category a record falls into, when the sample is sorted by an existing variable and divided into “n” groups of equal size. • Example: xtile can be used to create a variable that indicates which income quintile a household belongs to
  • 67. Creating New Variables • The syntax is: xtile newvar = variable [if exp] [in range] , nq(#) – where newvar is the new categorical variable created; variable is the existing variable used to create the quantile (e.g income, farm size); # is the number of different categories (eg 5 for quintiles, 3 for terciles)
  • 68. Creating New Variables • For example, xtile consq = cons, nq(5) xtile rconsq = rconsae, nq(10)
  • 69. Modifying Variables • We begin with an explanation of how to label data in Stata. Then see how to format variables. – rename variable – label variable – label define – label values – format variable
  • 70. Modifying Variables • rename variables – This command is used to rename variables in order to give other variable name. – The syntax is rename old_variable new_variable • Example: Generate a dummy for the region variable and rename the new dummy variables accordingly
  • 71. Modifying Variables Current residence • Region Place of Birth • Region Migrate to • Region Define lables Save label definitions Attach the defined lables to a variable Label Define Label Values
  • 72. Modifying Variables • label define – This command gives a name to a set of value labels. For example, instead of numbering the regions, we can assign a label to each region. • The syntax is: label define lblname # "label" # "label" # “label“ [, add modify]
  • 73. Modifying Variables • label values – This command attaches named set of value labels to a categorical variable. • The syntax is: label values varname [lblname] [, nofix]
  • 74. Modifying Variables label define reg 1"Tigray" 3"Amhara" 4"Oromia" 7"SNNP",modify label values q1a reg • Some additional commands that may be useful in labeling – label dir to request a list of existing label names – label list to request a list of all the existing value labels – label drop to delete a one or more labels – label save using to save label definitions as a Do-file – label data to give a label to a data file
  • 75. • tabulate … summarize – This command creates one- and two-way tables that summarize continuous variables. With the “summarize” option, we can put means and other statistics of a continuous variable. • The syntax is: tabulate varname1 varname2 [if exp] [in range], summarize(varname3) options • You can specify which statistics with options “means”, “standard” and “freq” Advanced Descriptive Statistics
  • 76. • Some examples: • tab q1a, sum(cons) gives the mean, std deviation, and frequency of per capita expenditure for each region • tab q1b, sum(cons) mean gives the mean consumption for each village • tab q1a sexh, sum(food) gives the mean, std deviation, and frequency in each cell of hh head sex per region Advanced Descriptive Statistics
  • 77. Advanced Descriptive Statistics • tabstat – This command gives summary statistics for a set of continuous variable for each value of a categorical variable. • The syntax is: tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname) • where – varlist is a list of continuous variables – statname is a type of statistic – varname is a categorical variable
  • 78. Some facts about this command: – The default statistic is the mean. – Optional statistics subcommands include mean, sum, max, min, range, sd (standard deviation), var (variance), skewness, kurtosis, median, and pn (nth percentile). – Without the by() option, tabstat is like “summarize” except that it allows you to specify the list of statistics to be displayed. – With the by() option, tabstat is like "tabulate … summarize “except that tabstat is more flexible in the statistics and format Advanced Descriptive Statistics
  • 79. • Examples – tabstat food hhsize, stats(mean max min) gives mean, max, and min of food & hhsize – tabstat food hhsize, by(q1a) gives mean of two variables for each region – tabstat food, stats(median) by(q1a) gives the median food consumption for each region • The tabstat command displays summary statistics for a series of numeric variables in a single table. Advanced Descriptive Statistics
  • 80. • table – This command creates a wide variety of tables. It is probably the most flexible and useful of all the table commands in Stata. • The syntax is: table rowvar colvar [if exp] [in range], c(clist) [row col] • where – rowvar is the categorical row variable – colvar is the categorical column variable – clist is a list of statistic and variables – row is an option to include a summary row – col is an option to include a summary column Advanced Descriptive Statistics
  • 81. • Some useful facts about this command: – The default statistic is the frequency. – Optional statistics are mean, sd, sum, rawsum (unweighted), count, max, min, median, and pn (nth percentile). – The c( ) is short for contents of each cell. – Like tab, it can be used to create one- and two- way frequency tables, but table cannot do percentages Advanced Descriptive Statistics
  • 82. Advanced Descriptive Statistics • Useful facts (cont.) : – Like tab…sum, it can be used to calculate basic stats for each value of a categorical variable – Its advantage over tab…sum is that it can do more statistics and it can take more than one continuous variable – Like tabstat, it can be used to calculate advanced stats for each value of a categorical variable – Its advantage over tabstat is that it can use two (and more) way tables, but its disadvantage is that it has fewer statistics.
  • 83. • Here are some examples: – table q1a , row table of frequencies by region with total row – table q1a, c(mean cons) table of average consumption by region – table q1a, c(mean food sd food median food) table of food Consumption statistics by region – table q1a, c(mean cons) format(%9.2f) table of average consumption by region with format . – table q1a sexh, c(mean cons) table of average consumption by region and sex – table q1a sexh, c(mean cons mean food) table of avg consumption & food consumption by region & sex Advanced Descriptive Statistics
  • 84. Presenting Data with Graph • The commands that draw graphs are graph twoway scatterplots, line plots, graph matrix scatterplot matrices graph bar bar charts graph dot dot charts graph box box-and-whisker plots graph pie pie charts
  • 85. Presenting Data with Graph • Examples graph twoway scatter cons food • We can show the regression line predicting cons from food using lfit option. twoway lfit cons food • The two graphs can be overlapped like this twoway (scatter cons hhsize) (lfit cons hhsize) twoway (scatter cons food) (lfit cons food)
  • 86. Presenting Data with Graph • Labeling graphs scatter var1 var2, title("title") subtitle("subtitle") xtitle("xtitle") ytitle("ytitle") note("note") • Example scatter ageh cons , title("title") subtitle("subtitle") xtitle("xtitle") ytitle("ytitle") note("note")
  • 87. Normality and Outlier • skewness and kurtosis sum rconspc sum rconspc, detail • check normality of a variable visually by looking at some basic graphs histogram rconspc histogram rconspc, normal histogram rconspc, normal bin(100)
  • 89. Normality and Outlier • graph box draws vertical box plots graph box rconspc, by(sexh) – y axis is numerical, and the x axis is categorical – upper and lower bounds of box are defined by the 25th and 75th percentiles – line within the box is the median – ends of the whiskers are 5th and 95th percentile • If rconspc is normal, the median would be in the center of the box and the end of whiskers would be equidistant from the box
  • 90. Normality and Outlier • The kdensity command with the normal option kdensity rconspc, normal – density graph of the residual with a normal distribution superimposed on the graph – useful in verifying that the residuals are normally distributed • pnorm command produces a P-P plot pnorm rconspc – It should be approximately linear if the variable follows normal distribution
  • 91. Normality and Outlier • Qnorm command plots the quantiles of a variable against the quantiles of a normal distribution qnorm rconspc – If the Q-Q plot shows a line that is close to the 45 degree line, the variable is more normally distributed • Both P-P and Q-Q plot prove that rconspc is not normal, with a long tail to the right • The qnorm plot is more sensitive to deviances from normality in the tails of the distribution • The pnorm plot is more sensitive to deviances near the mean of the distribution
  • 92. Normality and Outlier • Dealing with outliers – We have the following options when we have outliers • delete them from analyses • use measures that are not sensitive to them, such as median instead of mean • transform the data to be more normal • to replace them by imputation
  • 93. Normality and Outlier /* Calculate number of standard deviations from median by sex of hh head */ egen median=median(rconspc), by (sexh) egen sd=sd(rconspc), by (sexh) *generate the ratio of the deviation from the median to the standard deviation gen ratio=abs((rconspc-median)/sd) . sd median rconspc ratio  
  • 94. Normality and Outlier *generate an outlier dummy if the value is 3 times the ratio above gen outlier=1 if ratio>3 & ratio~=. replace outlier=0 if outlier==. & ratio~=. tabulate outlier, missing table sexh outlier, contents(mean rconspc) row col missing
  • 95. Normality and Outlier • Listwise deletion histogram rconspc if outlier==0, normal • Data transformation – a log transformation gen lnrconspc=ln(rconspc) histogram lnrconspc if rconspc~=., normal • Imputation – First the analyst estimates a regression model in which the dependent variable has missing values – In the second step, the estimated regression coefficients are used to predict (impute) missing values of that variable
  • 96. Normality and Outlier * Replace outliers to missing replace rconspc=. If outlier==1 regress lnrconspc i.q1a i.sexh i.poor hhsize ageh, robust predict yhat replace lnrconspc=yhat if rconspc==. • Or xi: impute lnrconspc i.q1a i.sexh i.poor hhsize ageh, gen(imputed)
  • 97. Statistical Tests • compare – The compare command is an easy way to check if two variables are the same compare lnrconspc imputed • correlate command – The correlate command displays a matrix of Pearson correlations for the variable listed. correlate cons hhsize correlate cons hhsize, means pwcorr cons hhsize, sig
  • 98. Statistical Tests • ttest command – If, for example, we like to see if the mean of hhsize equals to 6 by using single sample t-test, ttest command is used for this purpose. ttest hhsize=6 • We can also test if cons and food have the same mean ttest cons=food
  • 99. Statistical Tests • On the Side – How to interpret the P-values – Read the p-value for the results – Convert it to percentage (100*p) – Now let X=(100*p) – Decision rule • If , reject Ho at 1% level of significance • If , reject Ho at 5% level of significance • If , reject Ho at 10% level of significance 5 1   X 10 5   X 1  X
  • 100. Statistical Tests • ttest command for independent groups with pooled (equal) variance ttest cons, by(sexh) • ttest command for independent groups using unequal variance ttest cons, by(sexh) unequal • hotelling command performs Hotelling's T-squared test of whether the means are equal between two groups. hotel cons, by(sexh)
  • 101. Linear Regression • Regression analysis involves estimating an equation that best describes the data • One variable is considered the dependent variable, while the others are considered independent (or explanatory) variables • Stata is capable of many types of regression analysis and associated statistical test • Here we touch on only a few of the more common commands and procedures
  • 102. • regress – This is an example of ordinary linear regression by using regress command. reg cons hhsize – This regression tells us that for every extra person (hhsize) added to a household, total monthly expenditure (cons) will increase by about 40 Ethiopia Birr – This increase is statistically significant as indicated by the 0.000 probability associated with this coefficient Linear Regression
  • 103. – r-squared (r2) which equals to 0.0676. This value tells us that our independent variable (hhsize) accounts for approximately 7% of the variation of dependent variable (cons) – Running a regression with robust standard errors will tolerate a non-zero percentage of outliers, i.e., when the residuals are not iid – This is very useful when there is hetroscedasticity of variance. – The robust option does not affect the estimates of the regression coefficients reg cons hhsize, robust Linear Regression
  • 104. – Stata stores results from estimation commands in e(), and you can see a list of what exactly is stored using the ereturn list command. ereturn list – Using the generate command, we can extract those results, such as estimated coefficients and standard errors, to be used in other Stata commands. • reg cons hhsize • gen intercept=_b[_cons] • display intercept • gen slope=_b[hhsize] • display slope Linear Regression
  • 105. – The estimates table command displays a table with coefficients and statistics for one or more estimation sets in parallel columns estimates store estimatename estimates table, b se t p – The predict command computes predicted value and residual for each observation predict pred – When using the resid option the predict command calculates the residual. predict e, residual Linear Regression
  • 106. – We can plot the predicted value and observed value using graph twoway command. regress cons food predict pred graph twoway (scatter cons food) (line pred food) – The rvfplot command generates a plot of the residual versus the fitted values. It is used after regress command. regress cons food rvfplot – The rvpplot command produces a plot of the residual versus a specified predictor rvpplot food Linear Regression
  • 107. • Hypothesis tests – The test command performs Wald tests for simple and composite linear hypotheses about the parameters of estimation recode q1a 7/9=7 gen reg1=q1a==1 gen reg3=q1a==3 gen reg4=q1a==4 gen reg7=q1a==7 regress cons hhsize reg1 reg3 reg4 reg7 Linear Regression
  • 108. test reg3=0 test reg3= reg4= reg7 – The test command test the hypothesis that region 3 variable is zero (test reg3=0) and all region variables (region3= region4 = region 7) are zero, finding that the probability is very low (less than 0.000) so we can reject this hypothesis. – If you want to test the joint significance of a set of related variable, you can use testparm reg* test of hypothesis that all region dummies are zero Linear Regression
  • 109. • Ramsey RESET to test for omitted variables (misspecification) ovtest [, rhs] – This test amounts to estimating y = xb+zt+u and then testing t=0 regress cons hhsize reg3 reg4 reg7 ovtest tests significance of powers of predicted cons ovtest, rhs tests significance of powers of hhsize, reg3, reg4 and reg7 Linear Regression
  • 110. • Example; ovtest Ramsey RESET test using powers of the fitted values of cons Ho: model has no omitted variables F(3, 1441) = 4.47 Prob > F = 0.0039 – The ovtest, reject the hypothesis that there are no omitted variables, indicated that we need to improve the specification Linear Regression
  • 111. • Heteroskedasticity – We can use the hettest command to run an auxiliary regression of on the fitted values. hettest Ho: Constant variance Variables: fitted values of cons chi2(1) = 81.50 Prob > chi2 = 0.0000 – The hettest indicates that there is heterorskedasticity which needs to be dealt with Linear Regression
  • 112. • We can also use information matrix test by imtest command, which provides a summary test of violations of the assumptions on regression errors. imtest • The imtest also approved existence of heteroskedasticity, skweness and kurtosis problems Linear Regression
  • 113. – The xi prefix is used to dummy code categorical variables, and we tag these variables with an “i.” in front of each target variable xi: regress cons hhsize i.q1a, robust – By default, Stata selects the first category in the categorical variable as the reference category. If we would like to declare a certain category as reference category char q1a[omit] 7 xi:regress cons hhsize i.q1a, robust Linear Regression
  • 114. – Logistic regression logistic poor hhsize ageh sexh, coef xi:logit poor hhsize ageh sexh i.q1b ereturn list estat summarize estat ic mfx, (options) – Options dydx is the default. eyex specifies that elasticities be calculated in the form of d(lny)/d(lnx) dyex specifies that elasticities be calculated in the form of d(y)/d(lnx) eydx specifies that elasticities be calculated in the form of d(lny)/d(x) Linear Regression
  • 115. Data Management • We can subset data by keeping or dropping variables, or by keeping and dropping observations – keep and drop variables • The keep command is used to keep variables in the list while dropping other variables • The drop command is used to delete variables in the list while keeping other variables – keep and drop observations • The keep if command is used to keep observations if condition is met and vice versa for drop
  • 116. Data Management • sort – The sort command arranges the observations of the current data into ascending order based on the values of the variables listed • Variable ordering – The order command helps us to organize variables in a way that makes sense by changing the order of the variables • by command, _N is the total number of observations within each group listed in by command, and _n is the running counter to uniquely identify observations within the group
  • 117. Data Management • Often we don’t have all the info that we need in one dataset, and we have to merge them into one (since STATA allows for only one dataset in memory). • There are several types of “merging” datasets…
  • 118. Data Management • As long as the variables in the files are the same and the only thing you need to do is to add observations, this is vertical combination. • For this we use the append command. • Since this is used less often, I will skip it, but you can look at it in the help file.
  • 119. Data Management • Appending data files – concatenates two datasets, that is, stick them together vertically, one after another use tigray.dta, clear append using amhara.dta – The append command does not require that the two datasets contain the same variables. But it highly recommended to use identical list of variables for append command to avoid missing values from one dataset
  • 120. Data Management • If the identifying variable which appears in the files is unique in both files, then it's a one-to-one match. Unique means that for each value of this variable, there is only one observation that contains it. In the figure below, country is the identifying variable. In both datasets, each country has only one observation.
  • 121. Data Management • One-to-one match merging • The merge command sticks two datasets horizontally, one next to the other. Before any merge, both datasets must be sorted by identical merge variable use hh_characters.dta, clear merge 1:1 hhid using consum.dta
  • 122. Data Management • One-to-many matching – If the identifying variable is unique in one file, but not unique in the other, then it's a one-to-many matching.
  • 123. Data Management • Collapse – Sometimes we have data files that need to be aggregated at a higher level to be useful for us. For example, we have household data but we really interested in regional data. The collapse command serves this purpose by converting the dataset in memory into a dataset of means, sums, medians and percentiles • For instance, we would like to see the mean cons in each q1a and sex of hh head. collapse (mean) cons, by(q1a sex)
  • 124. Data Management • The reshape wide command tells system that we want to go from long to wide after collapsing . The i() option records row variable while j() column variable reshape wide cons, i(q1a) j(sexh)
  • 125. Importing Data • The insheet command can import data in text format (Tab delimited, or comma separated values CSV files). • Syntax: insheet [variable names] using <filename> [,options] • Options: – tab : tab-delimited data – comma : comma-delimited data – delimiter("char"): use char as delimiter – clear: replace data in memory – names : variable names are included on the first line of the file • Example cd “…Datafor stata training manual_EEA" clear insheet using ERHS_SPSS.csv, comma
  • 126. Good Sites to Look At! • STATA HELP – either online or in the software itself. • http://stataproject.blogspot.com. • http://www.stata.com/ • http://www.stata.com/statalist/ • http://ideas.repec.org/s/boc/bocode.html • http://www.princeton.edu/~erp/stata/main.html • http://www.cpc.unc.edu/services/computer/prese ntations/statatutorial/ • http://www.ats.ucla.edu/stat/stata/
  • 127. Good Sites to Look At! • Statalist is hosted at the Harvard School of Public Health, and is an email listserver where Stata users including experts writing Stata programs to users like us maintain a lively dialogue about all things statistical and Stata. You • can sign on to statalist so that you can receive as well as post your own questions through email.