Data Preparation for
Determining Sample Size
Bruce Ratner, Ph.D.
Data preparation can be defined as your acquaintance with the data to understanding what they tell you. You must 1] insure there are no impossible or improbable values (e.g., age of 120 years, or a boy named Sue, respectively), and 2] audit missing and zero values. When the data at hand are BIG (e.g., hundreds of variables), then the auditing of missing values can be onerous. It is not uncommon to have variables with different percentages (also know as “coverage”) of non-missing values. For example, variable INCOME has small (poor) coverage, typically 20%. That is, 20% of the sample has INCOME values, and the remaining 80% of the sample has missing values for INCOME. As another example, consider variable AGE, it has large (good) coverage, typically 90%. Thus, 90% of the sample has AGE values, and the remaining 10% of the sample has missing values for AGE.
When the coverage across the variables at hand has varying levels, say, between 15% - 100%, the complete-case analysis (CCA) sample size is ever minikin, rendering the intended analysis useless. The data analyst must decide on a single acceptable minimum coverage level for all variables that insures a reliable imputation for the missing values. This renders a stable CCA-sample size and a viable dataset, insuring a workable analysis.
Varying levels of Coverage requires finding the optimal mix between Number of Variables and CCA-Sample Size. The relationship among Coverage, Number of Variables and CCA-Sample Size is described below:
1. As the desired level of Coverage increases:
2. As the desired level of Coverage decreases:
********** SAS-code Program **********
PROC MEANS data=IN N noprint;
output out=COVERAGE (drop =_TYPE_ _FREQ_ ) N=;
set COVERAGE end=last;
array all_nums[*] _NUMERIC_;
/* make sure the length is long enough */
length DN $ 300;
do i = 1 to dim(all_nums);
if all_nums[i] < 1000 then DN = trim(DN) || ' ' || vname(all_nums[i]);
call symput('DN', DN);
call symput('AS', put(dim(all_nums), 8.-L));
if last then do;
LN = length(DN);
put 'length ' LN;
/* Dataset KEEP_VARS includes all character variables in dataset IN, irrespective of sample size*/
set IN ;
PROC MEANS data=KEEP_VARS n nmiss;
title2' KEEP-VARIABLES N ge 1000';
PROC MEANS data=DROP_VARS n nmiss;
title2' DROP-VARIABLES var N lt 1000';
1 800 DM STAT-1, or e-mail at email@example.com.