The members of CCORDA have expertise in design and planning of studies, including preparation
of data collection forms and database creation. We encourage researchers to include CCORDA
in all phases of the study from design to analysis and dissemination of results. Some
researchers collect and enter their own data for analysis. Accurate data entry is critical
for the success of the study. We have prepared some helpful hints for entering data into an
Excel Workbook for ease in statistical analysis.
Microsoft Excel can be a useful platform to enter and maintain research study data. Excel
is fairly easy to learn and use. Researchers can use Excel's simple statistical and plotting
functions to help gain insight into their data. However, most research projects require more
extensive statistical techniques that can be most easily performed using additional
statistical software packages such as SAS or SPSS software.
In order to have your data easily imported into a statistical software package we have
developed these guidelines for data entry into Excel.
Here is a good example of data entry into an Excel file followed by guidelines for
data entry.
Good Excel Data Sheet:
Guidelines:
Put variables in columns and observation in rows.
Include a unique identifying number for each case.
Be sure that each variable name is unique (no duplicate variable names).
Put variable names in the first row.
Variable names must start with a letter.
Do not include special characters (#, !, ?, %, etc.) or spaces in your variable names.
Choose readily recognizable names for variables - but not too long (<= 16 characters best).
Use a separate column for each piece of information.
Don’t enter data such as "120/80" for blood pressure. Enter systolic blood pressure as one variable and diastolic blood pressure as another variable. Don't enter data as "A,C,D" or "BDF" if there are three possible answers to a question. Include a separate column for each answer.
When entering dates (especially for years prior to 1930) include a 4 digit year.
Two digit years can cause problems for statistical software when reading data from Excel files. The best format for dates is mm/dd/yyyy, where mm is a 2 digit month, dd is a 2 digit day and yyyy is a 4 digit year.
Decide on "missingness" conventions.
Missing data can cause a multitude of problems. To enter a missing data value either enter a blank or an "impossible" numeric code (for numbers) or an easily recognizable single digit character code for character (trying to avoid mixing numeric and character data). Be sure, if you use a missing value code, that it cannot be confused with a "real" data value.
Be consistent in your data entry
When entering data keep the same format throughout.
Good Example:
ID
DOB
Sex
1
12/31/1976
F
2
01/01/1977
M
3
01/02/1977
F
4
01/03/1977
F
5
01/04/1977
M
6
01/05/1977
F
7
01/06/1977
M
8
01/07/1977
M
9
01/08/1977
F
10
01/09/1977
F
Bad Example:
ID
DOB
Sex
1
12/31/1976
f
2
1-Jan-77
m
3
01/02/1977
Female
4
01/03/77
F
5
01/04/1977
Male
6
01/05/1977
F
7
1/6/77 12:00 AM
M
8
01/07/1977
m
9
08-Jan-77
F
10
01/09/1977
f
Notice in the Good Example above that the date variable has the same format (mm/dd/yyyy) and
the sex variable is consistent throughout in both case and type (character variable). In the
Bad Example the date variable is in different formats without a 4 digit year for all the
observations. The sex variable is still a character variable, but statistical software will
read this variable as having six different levels instead of two.
Use only one worksheet for your data.
If you decide to use multiple sheets for you data, follow the variable naming
conventions for the tabs that name the sheets (keep the names simple and unique).
Do not "stack" data on the same sheets.
For example, treated versus non-treated patients can be handled by column
variable that has a code for Treated (yes/no).
Avoid using "special" Excel features (i.e. hidden columns, graphs on the data sheet
that is your primary database, colors, italics, bold).
These features can be used on other separate "subset" or "analysis" spreadsheets
that are for the investigator, but not the statistician or programmer.
Document your database with a data dictionary and/or codebook.
Documenting your database will help the statistician, programmer, data manager,
and yourself, the researcher, understand your data and database. It is a good idea
to document what your variables are and what they mean. The data dictionary should
include all of the variable names, data type that corresponds to the variable, a
label or longer name that describes the variable including the units it is measured
in, the codes for any categorical variables, and any notes for the variable. This
can be a separate worksheet or document file. Here is an example of a data
dictionary.
Data Dictionary:
Variable Name
Data Type
Description
Codes
Notes
ID
Numeric
Patient ID
Treated
Numeric
Treatment group
1=treated, 2=control
Age
Numeric
Age (years)
Sex
Character
Gender
M=Male, F=Female
Height
Numeric
Height (cm)
Blanks=missing data
Weight
Numeric
Weight (kg)
Blanks=missing data
Proc_date
Date
Procedure date
systolic_BP
Numeric
Systolic blood pressure
Blanks=missing data
diastolic_BP
Numeric
Diastolic blood pressure
Blanks=missing data
When in doubt, ask the statistician, data analyst, or programmer.
Be sure the effort you are putting forth is necessary. The CCORDA member should be able to tell you
precisely what form the data needs to be in to suit its conversion and analysis.