DM Stat-1 Articles
Link to Home

Link to Articles

Link to Consulting

Link to Seminar

Link to Stat-Chat

Link to Software

Link to Clients

Live chat by Boldchat
Live chat by Boldchat
Genetic Data Mining Method for the
Proper Use of the Correlation Coefficient
Bruce Ratner, Ph.D.

Assessing the relationship between a predictor variable and a target variable is an essential task in the model building process.
 If the relationship is identified and tractable, then the predictor variable is re-expressed to reflect the uncovered relationship,
and consequently tested for inclusion into the model. Most methods of variable assessment are based on the well-known
correlation coefficient, which is often misused because its linearity assumption is not tested. The purpose of this article is to
illustrate a genetic data mining method – the GenIQ Model© – that is perhaps the best “data-straightener” available today.
I use the third pair of x and y values from the well-known Anscombe data.


OUTLINE

I. Ancombe Data

ID      x         y

 1      10      7.46
 2        8      6.77
 3      13    12.74
 4        9      7.11
 5      11      7.81
 6      14      8.84
 7        6      6.08
 8        4      5.39
 9      12      8.15
10       7      6.42
11       5      5.73


II. GenIQ Model (Tree Display)

gtree
The GenIQ Model (Code)

x1 = .6550772;
        x2 = x;
   If x1 NE 0 Then x1 = x2 / x1; Else x1 = 1;
       x2 = x;
          x3 = x;
     x2 = x2 + x3;
     x2 = Cos(x2);
x1 = x1 + x2;
GenIQvar(y) = x1;


III. GenIQ Model Results
The results of the GenIQ Model as a data-straightener are in Table 2. There is a perfect rank-order prediction based on the descending
GenIQ model score GenIQvar(y), which is used to order the table.


Table 2. GenIQ Model Rank-order Prediction

ID     x        y     GenIQvar(y)

3      13    12.74      20.4919
6      14      8.84      20.4089
9      12      8.15      18.7426
5      11      7.81      15.7920
1      10      7.46      15.6735
4        9      7.11      14.3992
2        8      6.77      11.2546
10      7      6.42      10.8225
7        6      6.08      10.0031
11      5      5.73        6.7936
8        4      5.39        5.9607 


Perhaps, the best way of illustrating the GenIQ Model as a data-straightener, and a data mining tool are the plots below:
Plot y*x and Plot GenIQ*x.
gplots


IV. Summary 
Perhaps, the GenIQ Model is an excellent data-straightener and data mining tool all-in-one? What do you think?
Oh, two things - the correlation coefficients between y and x, and GenIQ(y) and x are: 0.81629 and 0.9895, respectively.

Go back to previous page.







For more information about this article, call Bruce Ratner at 516.791.3544, 1 800 DM STAT-1, or e-mail at br@dmstat1.com.

DM STAT-1 website visitors will receive my latest book Statistical Modeling and Analysis for Database Marketing:
Effective Techniques for Mining Big Data
at a 33%-off discount plus shipping costs - just for the asking.