Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Tips on Setting Up Excel Spreadsheets - Nancy Buderer

237 vues

Publié le

  • Soyez le premier à commenter

Tips on Setting Up Excel Spreadsheets - Nancy Buderer

  1. 1. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Nancy Buderer, MS Consulting Biostatistician, Program Evaluator, Research Consultant nancy@budererdrug.com ABSTRACT Objective Often the investigator of investigator-initiated research studies is responsible for entering his/her data into a database or spreadsheet before sending it to a statistician for analysis. This is common in teaching hospitals particularly for medical resident and student research. Typically Microsoft Excel is chosen because it is readily available and easy to use. This document assists investigators in designing a spreadsheet for their research data that can readily be exported from spreadsheet software (e.g., Microsoft Excel) into a statistical analysis software package (e.g., SAS or SPSS). Methods Best practices in spreadsheet design for research are provided along with examples in Microsoft Excel. Common pitfalls for data entry are described. Results Attention to detail is critical not only when collecting research data, but also when entering it into a spreadsheet. The following are basic concepts for designing a spreadsheet: 1. One row of data per subject 2. One column for each variable 3. First column is a unique identifier 4. Column labels follow SAS or SPSS naming conventions 5. Columns formatted according to their data type (numeric, mm/dd/yyyy, military time) 6. Data entered as numbers, not text 7. Coding system with documentation When possible, investigators should show their spreadsheet to their statistician before entering all of their data. This is valuable in many respects: the analyst can spot inconsistencies between the spreadsheet and the data collection tool, identify troublesome fields, and ensure that the outcomes the investigator intended to measure are captured on the spreadsheet in such a way as to allow for the appropriate statistical analysis. Conclusion A well-designed spreadsheet for entering research data may improve the accuracy of the data entered and save time in the data analysis phase. 1 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  2. 2. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS TIPS Tip 1. Each study subject has one ROW of data. It is simplest to keep everything for each research subject on one row, even if a subject has data at multiple time points. Below is an example where subjects have their blood pressure taken at two different times. Subject #1’s first blood pressure was 120 over 80. His second blood pressure was 115 over 80. The column labeled SBP_1 is the subject’s systolic blood pressure at time 1 (120); the column labeled DBP_1 is his diastolic blood pressure at time 1 (80); the column labeled SBP_2 is the subject’s systolic blood pressure at time 2 (115); and DBP_2 is his diastolic blood pressure at time 2. All of subject #1’s data is on one row of data. ID SBP_1 DBP_1 SBP_2 DBP_2 1 120 80 115 80 2 130 80 120 75 In the spreadsheet below, the subject has a row for time 1 and another row for time 2, but there is no way to distinguish between time points. This is much harder for analyses. ID SBP DBP 1 120 80 1 115 80 2 130 80 2 120 75 SYMBOLS Thumbs up – Do it like this Thumbs down – Don’t do this 2 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  3. 3. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Tip 2. Variables that are measured on the subjects are represented in COLUMNS. Use one column for each variable. For variables that only have one value (like age, gender, race) this is straightforward. If a particular variable has a “check all that apply” kind of response, then use one column for each possible response choice. Enter the number 1 if the subject chose that response and enter a 0 if they did not check that response. For example, a survey asks: “What kinds of exercise have you done in the last week (check all that apply)”? Run Walk Bike Swim Subject #1 did all of them. Enter 1 for RUN, 1 for WALK, 1 for BIKE, and 1 for SWIM. Subject #3 didn’t do any of these exercises, so enter 0 in each of the columns. ID RUN WALK BIKE SWIM 1 1 1 1 1 2 1 0 1 0 3 0 0 0 0 The example below tries to put all the response choices in one column. To the computer, this just looks like a string of characters. ID exercise 1 1,2,3,4 2 1,3 3 none ID exercise 1 run, walk, bike, swim 2 run, bike 3 none 3 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  4. 4. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Tip 3. The first column should be a unique identifier for the subject. Name, medical record number, social security number, etc., are convenient and unique identifiers, but they do not protect the subject’s privacy. Develop an arbitrary ID system. If it is necessary to keep a subject’s name or other private identifier, then keep a separate list that links name to the arbitrary ID number. Ideally, the ID should be a number, not a series of characters. In the example below, subjects 1 and 2 were enrolled in 2011 and subjects 3 and 4 were enrolled in 2012. Here are 3 ways to incorporate year into the ID. ID Year ID ID 1 2011 1001 1.2011 2 2011 1002 2.2011 3 2012 2001 1.2012 4 2012 2002 2.2012 Do not use private identifiers in spreadsheets that are designed for research purposes. • name • medical record number • date of birth • social security number Protect the subject’s privacy throughout the spreadsheet. It is rarely necessary for the statistician to receive identifying information. • Enter age in years rather than the actual date of birth. • If the actual date of an event is not necessary, then consider entering only the minimum necessary information (e.g., month and year). 4 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  5. 5. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Tip 4. Use column labels that follow standard SAS or SPSS variable naming rules. Typically a statistician will import an Excel spreadsheet into a statistical analysis software program like SAS or SPSS. By using column labels that already follow standard SAS or SPSS naming conventions, the researcher can save the statistician time (and money). [The words “column label” and “variable names” are used interchangeably.] Column Label Guidelines • Unique name for each column; no two columns with the same name • Up 32 characters in length (If using older versions of SPSS or SAS, keep this to 8 characters) • Starts with a letter • May combination of letters and numbers • No spaces • Underscore is OK, but other special characters are not • Upper or lower case; they’re treated the same Subject_ID Age_T1 med_costs date_first_use If you want to have more description in the column labels so that the data entry person has more complete description of the column, add it as the first row and keep the column labels as the second row. When the statistician does the analysis, this first row can be easily deleted leaving the clean set of column labels. (But don’t merge cells in this first upper row. Keep column widths the same throughout the column.) age at time 1 Medication costs 1st day used medication ID Age_T1 med_costs date_first_use If the same variable is measured at multiple time points, like age at time 1 and again at time 2, then keep the first part of the column label the same and change it by adding an underscore and the number behind it to indicate the time point (e.g., AGE_T1, AGE_T2). This first row can be easily deleted before analysis, leaving the simple column 5 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  6. 6. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Tip 5. Format columns according to their respective data type. If a cell in Excel is not specifically formatted, it defaults to “GENERAL”. This is the most dangerous format for researchers because just about anything can be typed into the cell. SAS or SPSS look at those values as strings of characters which can lead to meaningless data. Take the time to properly format columns according to their respective data type. Use the FORMAT CELLS menu to specify the data type of each column of data - numbers are numeric, dates are mm/dd/yyyy, etc. In Excel, do this by right clicking over the column - it is highlighted and a menu pops- up); choose “ Format Cells…”; choose the “Number” tab; and select from the list of formats. Another way to get to the FORMAT menu is from the FORMAT icon on the toolbar. (screen copied from Microsoft Excel, Microsoft Corporation, Microsoft Office 2007) 6 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  7. 7. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Formats preferred for research data are as follows: • Number Use this for any type of data that is numeric - continuous, interval, ordered, categorical if a number can be assigned to the category (e.g., 1=yes, 0=no). Excel’s default is 2 decimal points, but it can be changed as needed. • Date The simplest date format is mm/dd/yyyy. On the menu the example to choose reads “*3/14/2001”. When dates are entered into this column, enter using the slash marks as shown below. Notice that leading zeros on January through September are automatically dropped. date_first_use 5/1/2012 5/2/2012 10/2/2011 • Text If a data type has to be letters, then specify it as such and limit the column width to the maximum number of characters anticipated. Common uses for text fields are fill-in-the-blank responses in surveys. In the example from tip 2, “Other” might be a choice. A text column (other_text) allows for a fill-in-the-blank response. Run Walk Bike Swim Other _____________________________________ ID run walk bike swim other other_text 1 1 1 1 1 1 yoga 2 1 0 1 0 1 karate 3 0 0 0 0 1 aerobics • Time Choose military time - it avoids having to distinguish between AM and PM. The example on the menu is “13:30”. GENERAL format is not desirable for spreadsheets that are intended to be imported into a statistical package. 7 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  8. 8. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Setting NUMBER formats (screens copied from Microsoft Excel, Microsoft Corporation, Microsoft Office 2007) Setting DATE format 8 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  9. 9. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Tip 6. Data should be entered as NUMBERS wherever possible, not text, not special characters. Obviously, data that are numeric in nature (e.g., age, pain scores) will be entered as numbers. The problem comes when data are categorical (e.g., yes/no, gender, race). Letters are problematic in analyzing research data. Letters are case-sensitive (e.g., ‘N’ is different from ‘n’). Differently spelled words are interpreted as different categories even though they mean the same thing (e.g., ‘Yes’ is different from ‘YES’ and ‘Y’). Numbers are typically faster to enter and prone to fewer errors. For categorical data, it is recommended that code numbers be entered rather than words or letters. In the example below, use 1 for yes and 0 for no; use 1 for male and 2 for female; use 1 for Black and 2 for Caucasian. ID HAS_LIVING_WILL GENDER RACE 1 1 1 1 2 0 1 1 3 1 2 2 4 0 2 1 5 0 1 2 (*Using 1 for YES (or present) and 0 for NO (absent) is best for some data analyses such as logistic regression). ID HAS_LIVING_WILL GENDER RACE 1 Yes M Black 2 No M Black 3 yes F C 4 None f B 5 N m Caucasian 9 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  10. 10. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Tip 7. Always create a code sheet (i.e., key, codebook). This code sheet (or key) shows how the column name links to the data element you collected, and describes how code numbers represent various categories. This can be hand-written on the paper data collection tool or included as a separate worksheet with the same file Excel Workbook. For this spreadsheet … ID run walk bike swim other other_text has_living_will gender race 1 1 1 1 1 1 yoga 1 1 1 2 1 0 1 0 1 karate 0 1 1 3 0 0 0 0 1 aerobics 1 2 2 The code sheet might look like this … ID arbitrarily assigned ID number run subject runs 1=yes, 0=no walk subject walks 1=yes, 0=no bike subject bikes 1=yes, 0=no swim subject swims 1=yes, 0=no other subject indicated other form of exercise 1=yes, 0=no other_text write-in the other exercise has_living_will subject has a living will 1=yes, 0=no gender gender 1=Male, 2=Female race race 1=Black, 2=Caucasian 10 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  11. 11. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Tip 8. Carefully consider how to handle missing data and be consistent throughout. Missing data can be handled in a variety of ways. If data are missing, it is simplest – though not always best - to leave the cell completely empty. Don’t type anything in the cell - not even a space. In some cases it is necessary to distinguish a missing value by assigning it a code number. Some conventions for missing are to use 9, 9999, or 88. But if the data type is numeric (e.g., age), consider using -99 as the missing code or simply leaving the cell blank to avoid the potential that the missing code will be included in the analysis as a real number. • Leave cell empty • Assign special code • ‘NA’ • ‘N/A’ • ‘?’ • ‘Missing’ • ‘Don’t know’ • ‘---‘ • ‘(space)’ 11 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016
  12. 12. TIPS FOR SETTING-UP AN EXCEL SPREADSHEET FOR RESEARCH PROJECTS Tip 9. The spreadsheet that the statistician receives should simply have the raw data in rows and columns with simple column labels. Of course, Excel is much more powerful than just a place to put raw data. It has formulas, summary statistics, graphs, etc. But if the intent of the spreadsheet is to have it imported into a software package, then do not include these other features in the spreadsheet. If the researcher desires to calculate summary statistics or make graphs, this can be done on another worksheet (another tab) separate from the data worksheet. Use of color is often helpful for the data entry person, but keep in mind that color itself does not provide any meaningful information from a data analysis standpoint. It’s OK to use color, but if there is important information related to the color (e.g., rows highlighted in blue received the intervention and those in yellow did not), then create a column(s) to indicate that information (e.g., a column labeled GROUP where 1=intervention, 0=no intervention). Tip 10. Consult with your statistician. When possible, investigators should consider showing their spreadsheet to their statistician before entering all of their data. This is a valuable exercise in many respects: the analyst can spot inconsistencies between the spreadsheet and the data collection tool, identify troublesome fields, and ensure that the outcomes the investigator intended to measure are captured on the spreadsheet in such a way as to allow for the appropriate statistical analysis. The statistician can test-run the transfer of the data from Excel to the statistical analysis software. This small investment in time up-front can save hours in the end. Software referenced in this document are as follows: Microsoft Excel, Microsoft Corporation SAS – Statistical Analysis Software, SAS Institute SPSS – IBM SPSS I hope you have found this document helpful. I welcome your suggestions and comments. Contact me at nancy@budererdrug.com. Thank you. 12 | Nancy Buderer, MS nancy@budererdrug.com rev. 3-1-2016

×