
A Phat Example of the GenIQ Model's Predictive Power Bruce Ratner, Ph.D. The purpose of this article is to exemplify, or more to the point swank the predictive power of the GenIQ Model© – an alternative technique for modeling a binary or continuous target variable. The GenIQ Model©, which is based on the assumptionfree, nonparametric genetic paradigm inspired by Darwin’s Principle of Survival of the Fittest, offers theoretical and easeofuse advantages over the standard logistic and ordinary leastsquares regression models. It automatically and simultaneously “evolves” the model structure, and the variable selection among candidate predictor variables. The openworked GenIQ Model and its wordbook are both generally regarded as not demanding on newcomers of genetic modeling. A real case study using human age and fatness, let's call it the "Phat Example," is illustrated to encourage the use of the new method. I use the machine learning GenIQ Model to build a classification model, which predicts the rankorder likelihood of being a male, to illustrate the advantages, and to highlight the singular weakness of the machine learning paradigm. Specifically, the GenIQ Model shows the superiority of the machine learning paradigm over the statistical paradigm, as it not only specifies the true model form (a computer program), but simultaneously performs variable selection (which in this example is trival because only two predictor variables are considered), data mining and build the model – it’s like a Genetic Jackknife 3in1 Method. The difficulty in interpreting the computer program often accounts for the limited use of the machine learning paradigm. Outline of Article I. Situation The data come from a study investigating a new method of measuring body composition, and give the body fat percentage (PERCENT_FAT), AGE, and gender (if male then MALE=1, if female then MALE=0) for eighteen normal adults aged bewteen 23 and 61 years. How are AGE and PERCENT_FAT related, and is there any evidence that the relationship is different for males and females? Effectively, if a model that can distinguish between males and females can be build then the model is the evidence. The “Phat Example" data are in Table 1, below (from American Journal of Clinical Nutrition, 40, 834839). Table 1. The “Phat Example" Data I built the easytointerpret logistic regression model (LRM), and the notsoeasytointerpret GenIQ Model for the target variable MALE. This creates a counterpoint where the data analyst now can choose between a good interpretable model and a potentially better, unexplainable model.
II. LRM Output The LRM output (Analysis of Maximum Likelihood Estimates)  arguably the best PhatLRM equation (model) is: Log of odds of MALE(=1) = 11.0912 + 0.00940*AGE  4.9393*PERCENT_FAT III. PhatLRM Results The results of the PhatLRM are in Table 2. LRM log_of_odds_of_MALERankorder Prediction of MALE, below. There is not a perfect rankorder prediction of MALE for adult ID #7, as he is in the sixth rank, not the fourth rank, which would make the PhatLRM results perfect. Table 2. Rankorder Prediction of MALE based on log_of_odds_of_MALE IV. GenIQ Model Output The PhatGenIQ Model Tree Display and its Form (Computer Program) are below. The GenIQ Model (Tree Display) The GenIQ Model (Computer Program) x1 = PERCENT_FAT; x2 = AGE; x2 = Sin(x2); x1 = x2  x1; GenIQvar = x1; V. GenIQ Variable Selection GenIQ variable selection provides a rankordering of variable importance for a predictor variable with respect to other predictor variables considered jointly. This is in stark contrast to the wellknown, alwaysused statistical correlation coefficient, which only provides a simple correlation between a predictor variable and the target variable  independent of the other predictor variables under consideration. Because this study only has two predictor variables the rankordering of variable importance is trival. Variable Importance (w/r/to other variables considered jointly) 1. PERCENT_FAT 2. AGE VI. GenIQ Data Mining
GenIQ data mining is directly apparent from the GenIQ tree itself. Because this study only has two predictor variables, there are no signature GenIQ branches (genetically datamined structure, i.e., new variables  the "golden nuggets" desired from a data mining effort), only a sine tranformation of AGE, sin(AGE), denoted by sine_of_AGE, which actually is representative of data mining, albeit, the simplest form.
To appreciate the predictive power of the GenIQ Model it is enlightening to see the single relationships for each predictor variable with the target variable, in Tables 3, 4 and 5, which show the Rankorder Predictions of MALE based on AGE, on sine_of_AGE, and on PERCENT_FAT, respectively.. Then, image the brilliance of the builtin IQ of GenIQ, in how it uncovers and ties together the individual datamined relationships into its final model output in Section IV (GenIQ Model Tree Display and Computer Program) above, and in the GenIQ Model Results in Table 6 below. Table 3. Rankorder Prediction of MALE based on AGE Table 4. Rankorder Prediction of MALE based on sine_of_AGE Table 5. Rankorder Predictions of MALE based on PERCENT_FAT VII. PhatGenIQ Model Results The results of the PhatGenIQ Model are in Table 6. GenIQ Model GenIQvar Rankorder Prediction of MALE, below. There is a perfect rankorder prediction of MALE. Table 6. GenIQ Model GenIQvar Rankorder Prediction of MALE VIII. PhatGenIQ Model Version #2 Output and Results GenIQ modeling is like all other (nonphysical science) modeling: there is no unique model, but there are comparable, if not exact, results from alternative methods or different versions of the modeling process. To that end, I built a PhatGenIQ Model Version #2. The PhatGenIQ Model Version #2 Tree Display and Computer Program (which includes Int, the Integer function that takes the integer part of the number at hand), and its corresponding Table 7. GenIQ Model Version #2 GenIQvar2 Rankorder Prediction of MALE, below. GenIQ Model Version #2 produces a perfect rankorder prediction of MALE. However, I prefer the first PhatGenIQ Model over the version #2 model because the first model is compact (a desirable property of any model), and more precise model scores (obviously a desirable property of any model) than the second model. The first model is compact, albeit at the expensive of the unexpected appearance of the sine function. Also, its model scores for the top two adult ID's #3 and #4 have precisely distinguishing GenIQvar score values, 0.25638, and 0.74362, respectively. The PhatGenIQ Model Version #2 is definitely not easy on the eyes (not compact), although it uses the easytounderstand Integer function. But, it is not as precise as the first model, as it assigns the same GenIQvar2 score value of 0.00000 for the top two adult ID's #3 and #1. The less precise PhatGenIQ Model Version #2 readies an inquiry of whether the model is also less precise or discriminating visavis the first PhatGenIQ Model among the females (MALE=0). This can be addressed by the Coefficient of Variation (CV). (Recall, the CV is a dimensionless number that allows comparison of the variation of populations with different positive mean values. It is often reported as a percentage by multiplying the above calculation by 100. The smaller the CV number, the less variation among the population/sample values.) It is often reported as a percentage by multiplying the above calculation by 100.) I use the CV to see if the variation  as an indicator of spread or diversity of model scores  is less for the second model than it is for the first model. I disregard the negative sign of the model scores to have positive mean values. The CVs are 22.97 and 23.08 for the GenIQvar2 and GenIQvar scores, respectively. Thus, PhatGenIQ Model Version #2 is not as precise as PhatGenIQ Model to severalize the adult females. As a counterpoint to analysis and modeling tasks in the nonphysical science, consider: The world's most famous equation: E = mc**2 It is unique, precise, and beautifully compact. The GenIQ Model Version # 2 (Tree Display) The GenIQ Model Version #2 (Computer Program) x1 = PERCENT_FAT; x1 = Int(x1); x2 = AGE; x3 = PERCENT_FAT; x3 = Int(x3); x4 = .1407257; x5 = AGE; x4 = x4 * x5; x4 = Int(x4); x3 = x3 * x4; If x2 NE 0 Then x2 = x3 / x2; Else x2 = 1; x1 = x2  x1; GenIQvar = x1; Table7. GenIQ Model Version #2 GenIQvar2 Rankorder Prediction of MALE IX.Summary The machine learning paradigm (MLP) “let the data suggest the model” is a practical alternative to the statistical paradigm “fit the data to the equation,” which has its roots when data were only “small.” It was – and still is – reasonable to fit small data in a rigid parametric, assumptionfilled model. However, the current information (big data) in, say, cyberspace requires a paradigm shift. MLP is a utile approach for database modeling when dealing with big data, as big data can be difficult to fit in a specified model. Thus, MLP can function alongside the regnant statistical approach when the data – big or small – simply do not “fit.” As demonstrated with the “Phat Example” data, MLP works well within small data settings. Go to Articles page. For an eyeopening preview of the 9step modeling process of GenIQ, click here. 