Excel Tips for Easily Importing Data to Statistical Software such as SAS

Excel Tips for Easily Importing Data

to Statistical Software such as SAS

Many researchers use Microsoft's spreadsheet program Excel as a convenient platform to enter and maintain their study data. Excel is fairly easy to learn and use. It is also quite prevalent in the university enviroments (education and research). Many researchers often use Excel's simple statistical and plotting functions to help gain insight into their data. More often than not however, most research questions can only be adequately addressed after utilizing more sophisticated and involved analyses, which then require the use of dedicated statistical software.

The following are Excel tips for easily importing data to statistical software such as SAS.

Put variables in columns and observation in rows.

Put variable names in the first row.

Do not start a variable name with a number
Do not include special characters (#, !, ?, etc.) in your variable names
Please note that our DBMS/copy conversion software program will replace spaces with underscores ('_')
Choose readily recognizable names for variables - but not too long (<= 16 characters best)

Use a separate column for each piece of information.

Name: usually should be broken down into first name, last name and middle name when needed.

Fname Lname

John Smith

Mary Brown

Henry Johnson

Address: usually should be broken down into street, city, state, country, zip_code.

street city state country zip_code

34B, West Franklin Street Chapel Hill NC US 27514

Buck Jones Ave. 10 Raleigh NC US 24955

Decide on "missingness" conventions.

Avoid using "n/a" and other symbols of this nature.

Format a date as a date variable, not a character or numeric variable.

Establish a convention and then stick to it.

protocol_ID	Status
LCC320	closed	good pair	Bad to mix
LCC394	open	good pair
LCC605	YES	good pair
LCC298	NO	good pair
LCC707	yes	good pair
LCC577	no	good pair
LCC977	Y	good pair
LCC986	N	good pair

Don't make things harder than they need to be.

Do not "stack" data on the same sheets.

Avoid using "special" Excel features (i.e. hidden columns, graphs on the data sheet that is your primary database).

Do not use multiple sheets for each spreadsheet file (i.e. multiple sheets in one file).

Document your database.

Variable Name	Data Type	Description
p_fname	Character	Patient's first name
p_lname	Character	Patient's last name
vis_num	Numeric	The number of visits that the patient has paid
death_date	Date	Date of death

When in doubt, ask the statistician.

Preparing a data file or a database for survival analysis

Variables:

For performing survival analysis, a good way to structure the database is to have the following variables (with suggested variable names in parenthesis):

Date of Birth (DOB, DB)
Date of Diagnosis (DX, DDX)
Date of Disease Progression (DDP, PROG_DATE)
Date of Death (DD, DEATH_DATE)
Date of Last Follow-up (DLF, FU_DATE, STATUS_DATE)
Disease Progression (PROG): Yes/No, 1/0
Death (DIED): Yes/No, 1/0
Censored Yes/No, 1/0

Other variables could be entered as needed. For example, disease stage, grade, treatment, gender, etc. If any date is unknown, leave it blank. Interpretation of blank fields often depends on other fields. For example, if the date of disease progression is blank, then we know that either: a) the patient has not progressed, or b) the patient has progressed but the date of progression is unknown. To resolve this we look at variable PROG; if PROG="No" we'll know it is the former (not progressed); if PROG="Yes" we'll know it is the later (progressed, date unknown).

Other guidelines:

"Dates" should entered in full; month, day and year. Year should be four digits (1997, not 97; 2001 not 01). Dates can be entered as one field (e.g. "9/25/1997") or as three separate fields (e.g. "9", "25", 1997"). The later option is better when the day of the month is missing or unknown.
Do not put patient names or medical record numbers in the file. Assign study ID numbers, and keep a separate list that matches study ID's to medical record numbers and names.

University of North Carolina at Chapel Hill, Lineberger Comprehensive Cancer Center, Biostatistics and Data Management

street	city	state	country	zip_code
34B, West Franklin Street	Chapel Hill	NC	US	27514
Buck Jones Ave. 10	Raleigh	NC	US	24955

Fname	Lname
John	Smith
Mary	Brown
Henry	Johnson