
A
Phat Example of the GenIQ Model's Predictive Power
Bruce Ratner, Ph.D. The purpose of this
article is to exemplify, or more to the point swank the predictive
power of the GenIQ Model© – an alternative technique for
modeling a binary or continuous target variable. The GenIQ
Model©, which is based on the assumptionfree, nonparametric
genetic paradigm inspired by Darwin’s Principle of Survival of the
Fittest, offers theoretical and easeofuse advantages over the
standard logistic and ordinary leastsquares regression models. It
automatically and simultaneously “evolves” the model structure, and the
variable selection among candidate predictor variables. The openworked
GenIQ Model and its wordbook are both generally regarded as not
demanding on newcomers of genetic modeling. A real case study
using human age and fatness, let's call it the "Phat Example," is
illustrated to encourage the use of the new method.
I use the machine learning GenIQ Model to build a classification model, which predicts the rankorder likelihood of being a male, to illustrate the advantages, and to highlight the singular weakness of the machine learning paradigm. Specifically, the GenIQ Model shows the superiority of the machine learning paradigm over the statistical paradigm, as it not only specifies the closetotrue model form (a computer program), but simultaneously performs variable selection (which in this example is trivial because only two predictor variables are considered), performs data mining and builds the model – it’s like a Genetic Jackknife 3in1 Method. The difficulty in interpreting the computer program often accounts for the limited use of the machine learning paradigm. (For FAQs about GenIQ, click here.) Outline of Article I. Situation
The data come from a
study investigating a new method of measuring body composition, and
give the body fat percentage (PERCENT_FAT), AGE, and gender (if
male then MALE=1, if female then MALE=0) for eighteen normal
adults aged between 23 and 61 years. How are AGE and PERCENT_FAT
related, and is there any evidence that the relationship is different
for males and females? Effectively, if a model that can distinguish
between males and females can be built then the model is
the evidence. The “Phat Example" data are in
Table 1, below (from American Journal of Clinical Nutrition, 40,
834839).
Table 1. The “Phat Example" Data I built the
easytointerpret logistic regression model (LRM), and the
notsoeasytointerpret GenIQ Model for the target variable MALE. This
creates a counterpoint where the data analyst now can choose between a
good interpretable model and a potentially better,
unexplainable model.
II. LRM Output The LRM output (Analysis of Maximum Likelihood Estimates)  arguably the best PhatLRM equation (model) is: Log of odds of MALE(=1)
= 11.0912 + 0.00940*AGE  4.9393*PERCENT_FAT
III. PhatLRM Results The results of the
PhatLRM are in Table 2. LRM log_of_odds_of_MALERankorder Prediction
of MALE, below. There is not a perfect rankorder prediction of MALE
for adult ID #7, as he is in the sixth rank, not the fourth rank, which
would make the PhatLRM results perfect.
Table
2. Rankorder Prediction of MALE based on log_of_odds_of_MALE
IV. GenIQ Model Output
The PhatGenIQ Model Tree Display and its Form (Computer Program) are below. The GenIQ Model (Tree Display) The
GenIQ Model (Computer Program)
x1 =
PERCENT_FAT;
x2 = AGE; x2 = Sin(x2); x1 = x2  x1; GenIQvar = x1; V. GenIQ Variable
Selection
GenIQ variable
selection provides a rankordering of variable importance for a
predictor variable with respect to other predictor variables considered
jointly. This is in stark contrast to the wellknown, alwaysused
statistical correlation coefficient, which only provides a simple
correlation between a predictor variable and the target variable 
independent of the other predictor variables under consideration.
Because this study only has two predictor variables the rankordering
of variable importance is trivial.
Variable Importance
(w/r/to other variables considered jointly)
1. PERCENT_FAT 2. AGE VI. GenIQ Data Mining
GenIQ data mining is
directly apparent from the GenIQ tree itself. Because this study
only has two predictor variables, there are no signature GenIQ
branches (genetically datamined structure, i.e., new variables 
the "golden nuggets" desired from a data mining effort), only
a sine transformation of AGE, sin(AGE), denoted by sine_of_AGE, which
actually is representative of data mining, albeit, the simplest
form.
To appreciate the predictive power of the GenIQ Model it is enlightening to see the single relationships for each predictor variable with the target variable, in Tables 3, 4 and 5, which show the Rankorder Predictions of MALE based on AGE, on sine_of_AGE, and on PERCENT_FAT, respectively.. Then, image the brilliance of the builtin IQ of GenIQ, in how it uncovers and ties together the individual datamined relationships into its final model output in Section IV (GenIQ Model Tree Display and Computer Program) above, and in the GenIQ Model Results in Table 6 below. Table 3. Rankorder Prediction of MALE based on AGE Table 4. Rankorder Prediction of MALE based on sine_of_AGE Table 5. Rankorder Predictions of MALE based on PERCENT_FAT VII. PhatGenIQ Model Results The results of the PhatGenIQ Model are in Table 6. GenIQ Model GenIQvar Rankorder Prediction of MALE, below. There is a perfect rankorder prediction of MALE. Table 6. GenIQ Model GenIQvar Rankorder Prediction of MALE VIII. PhatGenIQ Model Version #2 Output and Results GenIQ modeling is like all other (nonphysical science) modeling: there is no unique model, but there are comparable, if not exact, results from alternative methods or different versions of the modeling process. To that end, I built a PhatGenIQ Model Version #2. The PhatGenIQ Model Version #2 Tree Display and Computer Program (which includes Int, the Integer function that takes the integer part of the number at hand), and its corresponding Table 7. GenIQ Model Version #2 GenIQvar2 Rankorder Prediction of MALE, below. GenIQ Model Version #2 produces a perfect rankorder prediction of MALE. However, I prefer the first PhatGenIQ Model over the version #2 model because the first model is compact (a desirable property of any model), and more precise model scores (obviously a desirable property of any model) than the second model. The first model is compact, albeit at the expense of the unexpected appearance of the sine function. Also, its model scores for the top two adult ID's #3 and #4 have precisely distinguishing GenIQvar score values, 0.25638, and 0.74362, respectively. The PhatGenIQ Model Version #2 is definitely not easy on the eyes (not compact), although it uses the easytounderstand Integer function. But, it is not as precise as the first model, as it assigns the same GenIQvar2 score value of 0.00000 for the top two adult ID's #3 and #1. The less precise PhatGenIQ Model Version #2 readies an inquiry of whether the model is also less precise or discriminating visavis the first PhatGenIQ Model among the females (MALE=0). This can be addressed by the Coefficient of Variation (CV). (Recall, the CV is a dimensionless number that allows comparison of the variation of populations with different positive mean values. It is often reported as a percentage by multiplying the above calculation by 100. The smaller the CV number, the less variation among the population/sample values.) It is often reported as a percentage by multiplying the above calculation by 100.) I use the CV to see if the variation  as an indicator of spread or diversity of model scores  is less for the second model than it is for the first model. I disregard the negative sign of the model scores to have positive mean values. The CVs are 22.97 and 23.08 for the GenIQvar2 and GenIQvar scores, respectively. Thus, PhatGenIQ Model Version #2 is not as precise as PhatGenIQ Model to severalize the adult females. As a counterpoint to analysis and modeling tasks in the nonphysical science, consider: The world's most
famous equation:
E = mc**2 It is unique, precise, and beautifully compact. The GenIQ
Model Version # 2 (Tree Display)
The
GenIQ Model Version #2 (Computer
Program)
x1 = PERCENT_FAT; x1 = Int(x1); x2 = AGE; x3 = PERCENT_FAT; x3 = Int(x3); x4 = .1407257; x5 = AGE; x4 = x4 * x5; x4 = Int(x4); x3 = x3 * x4; If x2 NE 0 Then x2 = x3 / x2; Else x2 = 1; x1 = x2  x1; GenIQvar = x1; Table7. GenIQ Model Version #2 GenIQvar2 Rankorder Prediction of MALE IX.Summary The machine learning paradigm (MLP) “let the data suggest the model” is a practical alternative to the statistical paradigm “fit the data to the equation,” which has its roots when data were only “small.” It was – and still is – reasonable to fit small data in a rigid parametric, assumptionfilled model. However, the current information (big data) in, say, cyberspace requires a paradigm shift. MLP is a utile approach for database modeling when dealing with big data, as big data can be difficult to fit in a specified model. Thus, MLP can function alongside the regnant statistical approach when the data – big or small – simply do not “fit.” As demonstrated with the “Phat Example” data, MLP works well within small data settings. For an eyeopening preview of the 9step modeling process of GenIQ, click here. Go to Articles page. 