Tips on Setting Up Excel Spreadsheets - Nancy Buderer
1. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Nancy Buderer, MS
Consulting
Biostatistician, Program Evaluator, Research Consultant
nancy@budererdrug.com
ABSTRACT
Objective
Often the investigator of investigator-initiated research studies is responsible for entering his/her data
into a database or spreadsheet before sending it to a statistician for analysis. This is common in
teaching hospitals particularly for medical resident and student research. Typically Microsoft Excel is
chosen because it is readily available and easy to use.
This document assists investigators in designing a spreadsheet for their research data that can readily be
exported from spreadsheet software (e.g., Microsoft Excel) into a statistical analysis software package
(e.g., SAS or SPSS).
Methods
Best practices in spreadsheet design for research are provided along with examples in Microsoft Excel.
Common pitfalls for data entry are described.
Results
Attention to detail is critical not only when collecting research data, but also when entering it into a
spreadsheet. The following are basic concepts for designing a spreadsheet:
1. One row of data per subject
2. One column for each variable
3. First column is a unique identifier
4. Column labels follow SAS or SPSS naming conventions
5. Columns formatted according to their data type (numeric, mm/dd/yyyy, military time)
6. Data entered as numbers, not text
7. Coding system with documentation
When possible, investigators should show their spreadsheet to their statistician before entering all of
their data. This is valuable in many respects: the analyst can spot inconsistencies between the
spreadsheet and the data collection tool, identify troublesome fields, and ensure that the outcomes the
investigator intended to measure are captured on the spreadsheet in such a way as to allow for the
appropriate statistical analysis.
Conclusion
A well-designed spreadsheet for entering research data may improve the accuracy of the data entered
and save time in the data analysis phase.
1 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
2. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
TIPS
Tip 1. Each study subject has one ROW of data.
It is simplest to keep everything for each research subject on one row, even if a subject has data at
multiple time points.
Below is an example where subjects have their blood pressure taken at two different times.
Subject #1’s first blood pressure was 120 over 80. His second blood pressure was 115 over 80.
The column labeled SBP_1 is the subject’s systolic blood pressure at time 1 (120); the column labeled
DBP_1 is his diastolic blood pressure at time 1 (80); the column labeled SBP_2 is the subject’s systolic
blood pressure at time 2 (115); and DBP_2 is his diastolic blood pressure at time 2. All of subject #1’s
data is on one row of data.
ID SBP_1 DBP_1 SBP_2 DBP_2
1 120 80 115 80
2 130 80 120 75
In the spreadsheet below, the subject has a row for time 1 and another row for time 2, but
there is no way to distinguish between time points. This is much harder for analyses.
ID SBP DBP
1 120 80
1 115 80
2 130 80
2 120 75
SYMBOLS
Thumbs up – Do it like this
Thumbs down – Don’t do this
2 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
3. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Tip 2. Variables that are measured on the subjects are represented in
COLUMNS. Use one column for each variable.
For variables that only have one value (like age, gender, race) this is straightforward.
If a particular variable has a “check all that apply” kind of response, then use one column for each
possible response choice. Enter the number 1 if the subject chose that response and enter a 0 if they
did not check that response.
For example, a survey asks:
“What kinds of exercise have you done in the last week (check all that apply)”?
Run
Walk
Bike
Swim
Subject #1 did all of them. Enter 1 for RUN, 1 for WALK, 1 for BIKE, and 1 for SWIM. Subject #3 didn’t
do any of these exercises, so enter 0 in each of the columns.
ID RUN WALK BIKE SWIM
1 1 1 1 1
2 1 0 1 0
3 0 0 0 0
The example below tries to put all the response choices in one column. To the computer, this
just looks like a string of characters.
ID exercise
1 1,2,3,4
2 1,3
3 none
ID exercise
1 run, walk, bike, swim
2 run, bike
3 none
3 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
4. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Tip 3. The first column should be a unique identifier for the subject.
Name, medical record number, social security number, etc., are convenient and unique identifiers, but
they do not protect the subject’s privacy.
Develop an arbitrary ID system.
If it is necessary to keep a subject’s name or other private identifier, then keep a separate list that links
name to the arbitrary ID number.
Ideally, the ID should be a number, not a series of characters.
In the example below, subjects 1 and 2 were enrolled in 2011 and subjects 3 and 4 were
enrolled in 2012. Here are 3 ways to incorporate year into the ID.
ID Year ID ID
1 2011 1001 1.2011
2 2011 1002 2.2011
3 2012 2001 1.2012
4 2012 2002 2.2012
Do not use private identifiers in spreadsheets that are designed for research purposes.
• name
• medical record number
• date of birth
• social security number
Protect the subject’s privacy throughout the spreadsheet. It is rarely necessary for the statistician to
receive identifying information.
• Enter age in years rather than the actual date of birth.
• If the actual date of an event is not necessary, then consider entering only the minimum necessary
information (e.g., month and year).
4 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
5. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Tip 4. Use column labels that follow standard SAS or SPSS variable naming
rules.
Typically a statistician will import an Excel spreadsheet into a statistical analysis software program like
SAS or SPSS. By using column labels that already follow standard SAS or SPSS naming conventions, the
researcher can save the statistician time (and money). [The words “column label” and “variable names”
are used interchangeably.]
Column Label Guidelines
• Unique name for each column; no two columns with the same name
• Up 32 characters in length (If using older versions of SPSS or SAS, keep this to 8 characters)
• Starts with a letter
• May combination of letters and numbers
• No spaces
• Underscore is OK, but other special characters are not
• Upper or lower case; they’re treated the same
Subject_ID Age_T1 med_costs date_first_use
If you want to have more description in the column labels so that the data entry person has more
complete description of the column, add it as the first row and keep the column labels as the second
row. When the statistician does the analysis, this first row can be easily deleted leaving the clean set of
column labels. (But don’t merge cells in this first upper row. Keep column widths the same throughout
the column.)
age at time 1
Medication
costs
1st
day used
medication
ID Age_T1 med_costs date_first_use
If the same variable is measured at multiple time points, like age at time 1 and again at time 2, then
keep the first part of the column label the same and change it by adding an underscore and the number
behind it to indicate the time point (e.g., AGE_T1, AGE_T2).
This first row
can be easily
deleted before
analysis,
leaving the
simple column
5 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
6. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Tip 5. Format columns according to their respective data type.
If a cell in Excel is not specifically formatted, it defaults to “GENERAL”. This is the most dangerous
format for researchers because just about anything can be typed into the cell. SAS or SPSS look at those
values as strings of characters which can lead to meaningless data.
Take the time to properly format columns according to their respective data type. Use the FORMAT
CELLS menu to specify the data type of each column of data - numbers are numeric, dates are
mm/dd/yyyy, etc. In Excel, do this by right clicking over the column - it is highlighted and a menu pops-
up); choose “ Format Cells…”; choose the “Number” tab; and select from the list of formats. Another
way to get to the FORMAT menu is from the FORMAT icon on the toolbar.
(screen copied from Microsoft Excel, Microsoft Corporation, Microsoft Office 2007)
6 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
7. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Formats preferred for research data are as follows:
• Number
Use this for any type of data that is numeric - continuous, interval, ordered, categorical if a
number can be assigned to the category (e.g., 1=yes, 0=no). Excel’s default is 2 decimal points,
but it can be changed as needed.
• Date
The simplest date format is mm/dd/yyyy. On the menu the example to choose reads
“*3/14/2001”. When dates are entered into this column, enter using the slash marks as shown
below. Notice that leading zeros on January through September are automatically dropped.
date_first_use
5/1/2012
5/2/2012
10/2/2011
• Text
If a data type has to be letters, then specify it as such and limit the column width to the
maximum number of characters anticipated. Common uses for text fields are fill-in-the-blank
responses in surveys. In the example from tip 2, “Other” might be a choice. A text column
(other_text) allows for a fill-in-the-blank response.
Run
Walk
Bike
Swim
Other _____________________________________
ID run walk bike swim other other_text
1 1 1 1 1 1 yoga
2 1 0 1 0 1 karate
3 0 0 0 0 1 aerobics
• Time
Choose military time - it avoids having to distinguish between AM and PM. The example on the
menu is “13:30”.
GENERAL format is not desirable for spreadsheets that are intended to be imported into a
statistical package.
7 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
8. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Setting NUMBER formats
(screens copied from Microsoft Excel, Microsoft Corporation, Microsoft Office 2007)
Setting DATE format
8 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
9. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Tip 6. Data should be entered as NUMBERS wherever possible, not text, not
special characters.
Obviously, data that are numeric in nature (e.g., age, pain scores) will be entered as numbers. The
problem comes when data are categorical (e.g., yes/no, gender, race). Letters are problematic in
analyzing research data. Letters are case-sensitive (e.g., ‘N’ is different from ‘n’). Differently spelled
words are interpreted as different categories even though they mean the same thing (e.g., ‘Yes’ is
different from ‘YES’ and ‘Y’). Numbers are typically faster to enter and prone to fewer errors. For
categorical data, it is recommended that code numbers be entered rather than words or letters.
In the example below, use 1 for yes and 0 for no; use 1 for male and 2 for female; use 1 for
Black and 2 for Caucasian.
ID HAS_LIVING_WILL GENDER RACE
1 1 1 1
2 0 1 1
3 1 2 2
4 0 2 1
5 0 1 2
(*Using 1 for YES (or present) and 0 for NO (absent) is best for some data analyses such as logistic
regression).
ID HAS_LIVING_WILL GENDER RACE
1 Yes M Black
2 No M Black
3 yes F C
4 None f B
5 N m Caucasian
9 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
10. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Tip 7. Always create a code sheet (i.e., key, codebook).
This code sheet (or key) shows how the column name links to the data element you collected, and
describes how code numbers represent various categories. This can be hand-written on the paper data
collection tool or included as a separate worksheet with the same file Excel Workbook.
For this spreadsheet …
ID run walk bike swim other other_text has_living_will gender race
1 1 1 1 1 1 yoga 1 1 1
2 1 0 1 0 1 karate 0 1 1
3 0 0 0 0 1 aerobics 1 2 2
The code sheet might look like this …
ID arbitrarily assigned ID number
run subject runs 1=yes, 0=no
walk subject walks 1=yes, 0=no
bike subject bikes 1=yes, 0=no
swim subject swims 1=yes, 0=no
other subject indicated other form of exercise 1=yes, 0=no
other_text write-in the other exercise
has_living_will subject has a living will 1=yes, 0=no
gender gender 1=Male, 2=Female
race race 1=Black, 2=Caucasian
10 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
11. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Tip 8. Carefully consider how to handle missing data and be consistent
throughout.
Missing data can be handled in a variety of ways. If data are missing, it is simplest – though not always
best - to leave the cell completely empty. Don’t type anything in the cell - not even a space. In some
cases it is necessary to distinguish a missing value by assigning it a code number. Some conventions for
missing are to use 9, 9999, or 88. But if the data type is numeric (e.g., age), consider using -99 as the
missing code or simply leaving the cell blank to avoid the potential that the missing code will be included
in the analysis as a real number.
• Leave cell empty
• Assign special code
• ‘NA’
• ‘N/A’
• ‘?’
• ‘Missing’
• ‘Don’t know’
• ‘---‘
• ‘(space)’
11 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016
12. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS
Tip 9. The spreadsheet that the statistician receives should simply have the raw
data in rows and columns with simple column labels.
Of course, Excel is much more powerful than just a place to put raw data. It has formulas, summary
statistics, graphs, etc. But if the intent of the spreadsheet is to have it imported into a software
package, then do not include these other features in the spreadsheet. If the researcher desires to
calculate summary statistics or make graphs, this can be done on another worksheet (another tab)
separate from the data worksheet.
Use of color is often helpful for the data entry person, but keep in mind that color itself does not provide
any meaningful information from a data analysis standpoint. It’s OK to use color, but if there is
important information related to the color (e.g., rows highlighted in blue received the intervention and
those in yellow did not), then create a column(s) to indicate that information (e.g., a column labeled
GROUP where 1=intervention, 0=no intervention).
Tip 10. Consult with your statistician.
When possible, investigators should consider showing their spreadsheet to their statistician before
entering all of their data. This is a valuable exercise in many respects: the analyst can spot
inconsistencies between the spreadsheet and the data collection tool, identify troublesome fields, and
ensure that the outcomes the investigator intended to measure are captured on the spreadsheet in such
a way as to allow for the appropriate statistical analysis. The statistician can test-run the transfer of the
data from Excel to the statistical analysis software. This small investment in time up-front can save
hours in the end.
Software referenced in this document are as follows:
Microsoft Excel, Microsoft Corporation
SAS – Statistical Analysis Software, SAS Institute
SPSS – IBM SPSS
I hope you have found this document helpful. I welcome your suggestions and comments.
Contact me at nancy@budererdrug.com. Thank you.
12 | Nancy Buderer, MS nancy@budererdrug.com
rev. 3-1-2016