Excel Tips for Easily Importing Data
to Statistical Software such as SAS
Many researchers use Microsoft's spreadsheet program Excel as a convenient platform to enter and maintain their study data.
Excel is fairly easy to learn and use. It is also quite prevalent in the university enviroments (education and research). Many
researchers often use Excel's simple statistical and plotting functions to help gain insight into their data. More often than not
however, most research questions can only be adequately addressed after utilizing more sophisticated and involved analyses,
which then require the use of dedicated statistical software.
The following are Excel tips for easily importing data to statistical software such as SAS.
- Put variables in columns and observation in rows.
- Put variable names in the first row. Also
- Do not start a variable name with a number
- Do not include special characters (#, !, ?, etc.) in your variable names
- Please note that our DBMS/copy conversion software program will replace spaces with
underscores ('_')
- Choose readily recognizable names for variables - but not too long (<= 16 characters best)
- Use a separate column for each piece of information. For example, do not put an
asterisk by a name because that person was lost to follow-up. If it is important
information, it deserves its own column.
Other examples:
- Name: usually should be broken down into first name, last name and middle name when needed.
Fname | Lname |
John | Smith |
Mary | Brown |
Henry | Johnson |
- Address: usually should be broken down into street, city, state, country, zip_code.
street | city | state | country | zip_code |
34B, West Franklin Street | Chapel Hill | NC | US | 27514 |
Buck Jones Ave. 10 | Raleigh | NC | US | 24955 |
- Decide on "missingness" conventions. A period mark works well (It is used by SAS for missing numeric values).
- Avoid using "n/a" and other symbols of this nature.
- Format a date as a date variable, not a character or numeric variable.
- Establish a convention and then stick to it. For example; avoid the following situtation:
protocol_ID | Status |
LCC320 | closed | good pair |
Bad to mix |
LCC394 | open |
LCC605 | YES | good pair |
LCC298 | NO |
LCC707 | yes | good pair |
LCC577 | no |
LCC977 | Y | good pair |
LCC986 | N |
For the variable "closed", which asks for the status of a protocol, all the above answers
are possible right answers to the question when they are not mixed up in one table.
Basically, what we need to do is just choose one pair and stick to it throughout one database
(don't forget case sensitivity).
- Don't make things harder than they need to be. Using the variable for gender
as an example, one does not necessarily need to make it numeric (i.e. with female=1
or male=2). This makes it harder than it needs to be to update and "QC" (or Quality
Control check) your data.
- Do not "stack" data on the same sheets. For example, tumor versus non-tumor patients
can be handled by column variable that has a code for Tumor (yes/no).
- Avoid using "special" Excel features (i.e. hidden columns, graphs on the data sheet
that is your primary database). These features can be used on other separate
"subset" or "analysis" spreadsheets that are for the investigator, not the
statistician or data manager.
- Do not use multiple sheets for each spreadsheet file (i.e. multiple sheets in one file).They will have to be separated
and re-saved as separate file so the conversion software would be able to do its work. Again you can do this
for yourself, but the statistician will need all the data on one sheet.
- Document your database. Documenting your database is a good way to help yourself and others understand your data and database. It is a good idea to document
what your variables are and what they mean. This can be in a formal fashion such as this:
Variable Name | Data Type | Description |
p_fname | Character | Patient's first name |
p_lname | Character | Patient's last name |
vis_num | Numeric | The number of visits that the patient has paid |
death_date | Date | Date of death |
Or it might possibly be in a separate documents. Also, if legends are needed, include them off to the side and not in the data area. (That is if you are doing this on the spread sheet that is to be given to the statistician or data manager so that it can be stripped off before it is converted by the conversion software to a SAS data set).
- When in doubt, ask the statistician.
Be sure that the effort you are putting forth is necessary. The statistician should be able to tell you precisely what form the data needs to be in to suit its conversion and analysis.
Preparing a data file or a database for survival analysis
Variables:
For performing survival analysis, a good way to structure the
database is to have the following variables (with suggested variable
names in parenthesis):
- Date of Birth (DOB, DB)
- Date of Diagnosis (DX, DDX)
- Date of Disease Progression (DDP, PROG_DATE)
- Date of Death (DD, DEATH_DATE)
- Date of Last Follow-up (DLF, FU_DATE, STATUS_DATE)
- Disease Progression (PROG): Yes/No, 1/0
- Death (DIED): Yes/No, 1/0
- Censored Yes/No, 1/0
Other variables could be entered as needed. For example, disease
stage, grade, treatment, gender, etc. If any date is unknown, leave
it blank. Interpretation of blank fields often depends on other fields.
For example, if the date of disease progression is blank, then we
know that either: a) the patient has not progressed, or b) the patient has
progressed but the date of progression is unknown. To resolve
this we look at variable PROG; if PROG="No" we'll know it is the
former (not progressed); if PROG="Yes" we'll know it is the later
(progressed, date unknown).
Other guidelines:
- "Dates" should entered in full; month, day and year.
Year should be four digits (1997, not 97; 2001 not 01).
Dates can be entered as one field (e.g. "9/25/1997")
or as three separate fields (e.g. "9", "25", 1997").
The later option is better when the day of the month is
missing or unknown.
- Do not put patient names or medical record numbers in the file.
Assign study ID numbers, and keep a separate list that matches
study ID's to medical record numbers and names.
University of North Carolina at Chapel Hill, Lineberger Comprehensive Cancer Center, Biostatistics and Data Management
|