Preparing Data in Excel

The members of CCORDA have expertise in design and planning of studies, including preparation of data collection forms and database creation. We encourage researchers to include CCORDA in all phases of the study from design to analysis and dissemination of results. Some researchers collect and enter their own data for analysis. Accurate data entry is critical for the success of the study. We have prepared some helpful hints for entering data into an Excel Workbook for ease in statistical analysis.  

Microsoft Excel can be a useful platform to enter and maintain research study data. Excel is fairly easy to learn and use. Researchers can use Excel's simple statistical and plotting functions to help gain insight into their data. However, most research projects require more extensive statistical techniques that can be most easily performed using additional statistical software packages such as SAS or SPSS software.

In order to have your data easily imported into a statistical software package we have developed these guidelines for data entry into Excel.

Here is a good example of data entry into an Excel file followed by guidelines for data entry.

Good Excel Data Sheet:

Guidelines:

  1. Put variables in columns and observation in rows.
    • Include a unique identifying number for each case.
    • Be sure that each variable name is unique (no duplicate variable names).
  2. Put variable names in the first row.
    • Variable names must start with a letter.
    • Do not include special characters (#, !, ?, %, etc.) or spaces in your variable names.
    • Choose readily recognizable names for variables - but not too long (<= 16 characters best).
  3. Use a separate column for each piece of information.
    • Don’t enter data such as "120/80" for blood pressure. Enter systolic blood pressure as one variable and diastolic blood pressure as another variable. Don't enter data as "A,C,D" or "BDF" if there are three possible answers to a question. Include a separate column for each answer.
  4. When entering dates (especially for years prior to 1930) include a 4 digit year.
    • Two digit years can cause problems for statistical software when reading data from Excel files. The best format for dates is mm/dd/yyyy, where mm is a 2 digit month, dd is a 2 digit day and yyyy is a 4 digit year.
  5. Decide on "missingness" conventions.
    • Missing data can cause a multitude of problems. To enter a missing data value either enter a blank or an "impossible" numeric code (for numbers) or an easily recognizable single digit character code for character (trying to avoid mixing numeric and character data). Be sure, if you use a missing value code, that it cannot be confused with a "real" data value.
  6. Be consistent in your data entry
    • When entering data keep the same format throughout.

Good Example

ID DOB Sex
1 12/31/1976 F
2 01/01/1977 M
3 01/02/1977 F
4 01/03/1977 F
5 01/04/1977 M
6 01/05/1977 F
7 01/06/1977 M
8 01/07/1977 M
9 01/08/1977 F
10 01/09/1977 F

  Bad Example

ID DOB Sex
1 12/31/1976 f
2 1-Jan-77 m
3 01/02/1977 Female
4 01/03/77 F
5 01/04/1977 Male
6 01/05/1977 F
7 1/6/77 12:00 AM M
8 01/07/1977 m
9 08-Jan-77 F
10 01/09/1977 f


Notice in the Good Example above that the date variable has the same format (mm/dd/yyyy) and the sex variable is consistent throughout in both case and type (character variable). In the Bad Example the date variable is in different formats without a 4 digit year for all the observations. The sex variable is still a character variable, but statistical software will read this variable as having six different levels instead of two.

  1. Use only one worksheet for your data.
    • If you decide to use multiple sheets for you data, follow the variable naming conventions for the tabs that name the sheets (keep the names simple and unique).
  2. Do not "stack" data on the same sheets.
    • For example, treated versus non-treated patients can be handled by column variable that has a code for Treated (yes/no).
  3. Avoid using "special" Excel features (i.e. hidden columns, graphs on the data sheet that is your primary database, colors, italics, bold).
    • These features can be used on other separate "subset" or "analysis" spreadsheets that are for the investigator, but not the statistician or programmer.
  4. Document your database with a data dictionary and/or codebook.
    • Documenting your database will help the statistician, programmer, data manager, and yourself, the researcher, understand your data and database. It is a good idea to document what your variables are and what they mean. The data dictionary should include all of the variable names, data type that corresponds to the variable, a label or longer name that describes the variable including the units it is measured in, the codes for any categorical variables, and any notes for the variable. This can be a separate worksheet or document file. Here is an example of a data dictionary.

      Data Dictionary:

      Variable Name

      Data Type

      Description

      Codes

      Notes

      ID Numeric Patient ID
      Treated Numeric Treatment group 1=treated, 2=control
      Age Numeric Age (years)
      Sex Character Gender M=Male, F=Female
      Height Numeric Height (cm) Blanks=missing data
      Weight Numeric Weight (kg) Blanks=missing data
      Proc_date Date Procedure date
      systolic_BP Numeric Systolic blood pressure Blanks=missing data
      diastolic_BP Numeric Diastolic blood pressure Blanks=missing data
  5. When in doubt, ask the statistician, data analyst, or programmer.
    • Be sure the effort you are putting forth is necessary. The CCORDA member should be able to tell you precisely what form the data needs to be in to suit its conversion and analysis.