
Data
Preparation for
Determining Sample Size Bruce Ratner, Ph.D.
Data preparation can be
defined as your acquaintance with the data to understanding what they
tell you. You must 1] insure there are no impossible or improbable
values (e.g., age of 120 years, or a boy named Sue, respectively), and
2] audit missing and zero values. When the data at hand are BIG (e.g.,
hundreds of variables), then the auditing of missing values can be
onerous. It is not uncommon to have variables with different
percentages (also know as “coverage”) of nonmissing values. For
example, variable INCOME has small (poor) coverage, typically 20%. That
is, 20% of the sample has INCOME values, and the remaining 80% of the
sample has missing values for INCOME. As another example, consider
variable AGE, it has large (good) coverage, typically 90%. Thus, 90% of
the sample has AGE values, and the remaining 10% of the sample has
missing values for AGE.
When the coverage
across the variables at hand has varying levels, say, between 15% 
100%, the completecase analysis sample size is ever minikin, rendering
the intended analysis useless. The data analyst must decide on a single
acceptable minimum coverage level for all variables that insures a
reliable imputation for the missing values. This renders a stable
CCAsample size and a viable dataset, insuring a workable analysis.
Varying levels of Coverage requires finding the optimal mix between Number of Variables and CCASample Size. The relationship among Coverage, Number of Variables and CCASample Size is described below: 1. As the desired level of Coverage increases:
2. As the desired level of Coverage decreases:
**********
SAScode Program **********
PROC MEANS data=IN N noprint; var _NUMERIC_; output out=COVERAGE (drop =_TYPE_ _FREQ_ ) N=; run; DATA _NULL_; set COVERAGE end=last; array all_nums[*] _NUMERIC_; /* make sure the length is long enough */ length DN $ 300; do i = 1 to dim(all_nums); if all_nums[i] < 1000 then DN = trim(DN)  ' '  vname(all_nums[i]); end; call symput('DN', DN); call symput('AS', put(dim(all_nums), 8.L)); if last then do; LN = length(DN); put 'length ' LN; end; RUN; %put &DN; %put &AS; /* Dataset KEEP_VARS includes all character variables in dataset IN, irrespective of sample size*/ DATA KEEP_VARS; set IN ; drop &DN; RUN; PROC MEANS data=KEEP_VARS n nmiss; title2' KEEPVARIABLES N ge 1000'; RUN; data DROP_VARS; set IN; keep &DN; run; PROC MEANS data=DROP_VARS n nmiss; title2' DROPVARIABLES var N lt 1000'; RUN; **********
SAScode Program **********
PROC MEANS data=IN N noprint; var _NUMERIC_; output out=COVERAGE (drop =_TYPE_ _FREQ_ ) N=; run; DATA _NULL_; set COVERAGE; array all_nums[*] _NUMERIC_; /* make sure the length is long enough */ length DN $ 300; do i = 1 to dim(all_nums); if all_nums[i] < 1000 then DN = trim(DN)  ' '  vname(all_nums[i]); end; call symput('DN', DN); call symput('AS', put(dim(all_nums), 8.L)); RUN; %put &DN; %put &AS; DATA KEEP_VARS; set IN ; drop &DN; RUN; PROC MEANS data=KEEP_VARS n nmiss; title2' KEEPVARIABLES N ge 1000'; RUN; data DROP_VARS; set IN; keep &DN; run; PROC MEANS data=DROP_VARS n nmiss; title2' DROPVARIABLES var N lt 1000'; RUN; 1 800 DM STAT1, or email at br@dmstat1.com. 
