3. Lecture Outline
• What is SPSS
• Uses of SPSS
• Preparing to enter data
• Preparing a data dictionary
• Data Structures
• Errors in data
• Data Cleaning
5. Uses of SPSS
• Data entry
• Data cleaning and editing
• Data analysis
• Data presentation
• Data Importing and Exporting
• SPSS Data Library
6. Preparing a Data Dictionary
• What is a data dictionary?
– A book/document containing all variables and
the codes/categories assigned to them
– Also contains how the variables will be
entered and other remarks necessary
– Specifies width/ length of variables
– Specifies how missing values will be
assigned
7. Data coding
• Translation of responses on the
questionnaires or data collection sheets to
specific categories for the purpose of
analysis.
• Assignment of numbers to the various
levels of the variables.
• Load of work light for pre-coded
questionnaires
9. • Need to assign numerical codes to
categorical data before entering
• For example, you may choose to assign
codes of 1, 2, 3 and 4 to categories of “no
pain”, „mild pain”, “moderate pain” and
“severe pain” respectively
10. • These codes can be put in the
questionnaire when collecting the data.
• For binary data e.g. yes/no answers, it is
often convenient to assign codes 1 (e.g.
for yes) and 0 or 2 (for no).
11. NEED FOR CODING GUIDE/Data
Dictionary
• Prepare data in format to allow use of
computers for statistical analysis.
• Prepare code book or data dictionary for
the questionnaire.
• Specify range of values expected.
12. • Unit of measurement should be consistent
for all observations on a variable. E.g.
weight should be recorded in kg or in
pounds , but not both interchangeably
• Time: days? Hours?
– For example length of hospital stay
13. Example of data dictionary
Variable Variable label Value labels width Remark
Name
1. Age AGE --------- 2 Missing=99
2. Sex SEX 1=male 1 -----
2=female
3. Do you SMOKE 1=YES 1 Missing=9
smoke? 2=NO
14. Example
• Topic: Smoking among medical students
• 200 questionnaires/records
• 6 questions/variables
15. Variables in questionnaire
• Serial Number
• Age
• Place of residence
1=on campus, 2=off campus
• Sex
1=male, 2= female
• Do you smoke?
1=yes 2= No
18. SPSS windows
• Data editor
– For data entry
– For statistical analysis
• Viewer
– Results are displayed
19. Data Editor
• Two views
– Data view: for data entry
– Variable view: to define variable
characteristics
20. Preparing Data Structures in
SPSS
• Variable views
– Variable names
– Variable types
– Value labels
– Variable width
– Column
– Measure
21. Data Entry
• Use of computer packages such as SPSS
– Improves the accuracy and speed of data analysis
• Makes it easy to check for errors, produces graphical
summaries and generates new variables
- Log in data as it arrives
• Frequent backing-up
22. • Problems with dates and times:
– dates and times should be entered in a
consistent manner, e.g. as day/month/year or
month/day/year but not interchangeably.
– It is important to find out what format the
statistical package can read
23. Handling missing data
• Consider what to do with missing values before data is
entered.
• In most cases, need to use some symbol to represent a
missing value
• Statistical packages deal with missing values in different
ways
• Some use special characters (e.g. a full stop or asterisk)
to indicate missing values, whereas others require you to
define your own code for a missing value (commonly
used values are 9, 99 or 999)
• The value that is chosen should be one that is not
possible for that variable
24. • For example when entering a categorical
variable with four categories (coded 1,2,3,
and 4), you may choose the value 9 to
represent missing values.
• However, if the variable is age of a child,
then a different code should be chosen.
25. • If a large proportion of data is missing, then the results
are likely to be unreliable
• Reasons why data are missing should always be
investigated: how much is missing and why?
• If missing data tend to cluster around a particular
variable, or in a particular sub group of individuals, then
it ,may indicate that the variable is not applicable or has
not been measured for that group of individuals
26. • Then the group of individuals should be
excluded from any analysis on that
variable
• Or it may be that the data is simply sitting
on a piece of paper in someone‟s drawer
and are yet to be entered!
27. Errors in data
• In any study, there is always the potential
for errors to occur in a data set, either at
the outset when taking the measurements,
or when collecting, and entering data onto
a computer
• It is hard to eliminate all of these errors
• But one can reduce the number of typing
errors by checking the data carefully once
they have been entered.
28. Common sources of error
• „not applicable‟ or „blank‟ coded as “0”
• typing errors on data entry- 18 INSTEAD
of 81
• column shift- data for one variable column
was entered under the adjacent column
• coding errors
• Loss of concentration
30. Detecting errors
- Check for completeness and
correctness of records.
- Indicate admissible values during
data entry
- Range checks-permissible responses.
- Statistical editing
31. How to Detect Errors via
Statistical editing
• Produce descriptive statistics for all
variables.
• Check frequency distribution of each
variable 1=male, 2=female, 3?
• Standard deviation higher than mean;
check for outlying observation
32. Quality Control
- Record verification (double entry)
- Does not rule out the possibility that the same error
has been incorrectly entered on the two occasions
- Disadvantage of this approach is that it takes twice as
long to enter the data, which may have major cost or
time implications
- Creating check files
- Random checking: selection at random but
should represent all forms being entered
33. Error checking
• Categorical Data: relatively easy, values not allowable
must be errors
• Check frequency distribution of each variable 1=male,
2=female, 3?
• Numerical data: Produce descriptive statistics for all
variables.
• Standard deviation higher than mean; check for outlying
observation
• range checks, upper and lower limits can be specified for
each variable
34. • Dates: not easy to check accuracy of dates, for
example 30th feb. must be incorrect, any day of
the month greater than 31, any month greater
than 12
• Apply logical checks:
– date of birth should correspond to patient‟s age
– subjects should usually have been born before
entering the study( at least in most studies)
– patients who have died should not appear on
subsequent follow up visits
– there should be no pregnant men
35. • With all error checks, a value should only
be corrected if there is evidence that a
mistake has been made
• Do not change values simply because
they look unusual; investigate