Excel Tips for Easily Importing Data

to Statistical Software such as SAS


Many researchers use Microsoft's spreadsheet program Excel as a convenient platform to enter and maintain their study data. Excel is fairly easy to learn and use. It is also quite prevalent in the university enviroments (education and research). Many researchers often use Excel's simple statistical and plotting functions to help gain insight into their data. More often than not however, most research questions can only be adequately addressed after utilizing more sophisticated and involved analyses, which then require the use of dedicated statistical software.

The following are Excel tips for easily importing data to statistical software such as SAS.

  1. Put variables in columns and observation in rows.
  2. Put variable names in the first row.
  3. Also

  4. Use a separate column for each piece of information.
  5. For example, do not put an asterisk by a name because that person was lost to follow-up. If it is important information, it deserves its own column.
    Other examples:

  6. Decide on "missingness" conventions.
  7. A period mark works well (It is used by SAS for missing numeric values).

  8. Avoid using "n/a" and other symbols of this nature.
  9. Format a date as a date variable, not a character or numeric variable.
  10. Establish a convention and then stick to it.
  11. For example; avoid the following situtation:
       
    protocol_IDStatus  
    LCC320closedgood pair Bad to mix
    LCC394open
    LCC605YESgood pair
    LCC298NO
    LCC707yesgood pair
    LCC577no
    LCC977Ygood pair
    LCC986N

    For the variable "closed", which asks for the status of a protocol, all the above answers are possible right answers to the question when they are not mixed up in one table. Basically, what we need to do is just choose one pair and stick to it throughout one database (don't forget case sensitivity).

  12. Don't make things harder than they need to be.
  13. Using the variable for gender as an example, one does not necessarily need to make it numeric (i.e. with female=1 or male=2). This makes it harder than it needs to be to update and "QC" (or Quality Control check) your data.

  14. Do not "stack" data on the same sheets.
  15. For example, tumor versus non-tumor patients can be handled by column variable that has a code for Tumor (yes/no).

  16. Avoid using "special" Excel features (i.e. hidden columns, graphs on the data sheet that is your primary database).
  17. These features can be used on other separate "subset" or "analysis" spreadsheets that are for the investigator, not the statistician or data manager.

  18. Do not use multiple sheets for each spreadsheet file (i.e. multiple sheets in one file).
  19. They will have to be separated and re-saved as separate file so the conversion software would be able to do its work. Again you can do this for yourself, but the statistician will need all the data on one sheet.

  20. Document your database.
  21. Documenting your database is a good way to help yourself and others understand your data and database. It is a good idea to document what your variables are and what they mean. This can be in a formal fashion such as this:

    Variable NameData TypeDescription
    p_fnameCharacterPatient's first name
    p_lnameCharacterPatient's last name
    vis_numNumericThe number of visits that the patient has paid
    death_dateDateDate of death

    Or it might possibly be in a separate documents. Also, if legends are needed, include them off to the side and not in the data area. (That is if you are doing this on the spread sheet that is to be given to the statistician or data manager so that it can be stripped off before it is converted by the conversion software to a SAS data set).

  22. When in doubt, ask the statistician.
  23. Be sure that the effort you are putting forth is necessary. The statistician should be able to tell you precisely what form the data needs to be in to suit its conversion and analysis.

Preparing a data file or a database for survival analysis

Variables:

For performing survival analysis, a good way to structure the database is to have the following variables (with suggested variable names in parenthesis):

Other variables could be entered as needed. For example, disease stage, grade, treatment, gender, etc. If any date is unknown, leave it blank. Interpretation of blank fields often depends on other fields. For example, if the date of disease progression is blank, then we know that either: a) the patient has not progressed, or b) the patient has progressed but the date of progression is unknown. To resolve this we look at variable PROG; if PROG="No" we'll know it is the former (not progressed); if PROG="Yes" we'll know it is the later (progressed, date unknown).

Other guidelines:
  1. "Dates" should entered in full; month, day and year. Year should be four digits (1997, not 97; 2001 not 01). Dates can be entered as one field (e.g. "9/25/1997") or as three separate fields (e.g. "9", "25", 1997"). The later option is better when the day of the month is missing or unknown.
  2. Do not put patient names or medical record numbers in the file. Assign study ID numbers, and keep a separate list that matches study ID's to medical record numbers and names.



University of North Carolina at Chapel Hill, Lineberger Comprehensive Cancer Center, Biostatistics and Data Management