Statistics versus Machine Learning:

A Significant Difference for Database Response Modeling

Bruce Ratner, Ph.D.

The regnant statistical paradigm for database response modeling is: The data analyst fits the data to the presumedly true logistic regression model (LRM), which has the form (equation) of (log of the odds of) response is the sum of weighted predictor variables. The predictor variables are determined by a mixture of well-established variable selection methods and the will of the data analyst to re-express the original variables and construct new variables (data mining). The weights, better known as the regression coefficients, are determined by the pre-programmed machine-crunching method of calculus. The purpose of this article is to show a significant difference for database response modeling when implementing the antithetical machine learning paradigm: The data suggests the “true” model form, as the machine learning process acquires knowledge of the form without being explicitly programmed.

Outline of Article

I. Situation

When my daughter Amanda was in grade school, she could not understand the decision-making process of her principal Dr. Katz. On some rainy days, Dr. Katz would permit the class to go outside for recess to play. On other days when it was sunny, Dr. Katz would said, “no play.”  As a statistician’s daughter, Amanda collected some weather information, and asked me to build a model to predict what Dr. Katz will do in the days to come. Amanda created a “Let’s Play” database, in Table 1 (also in Quinlan’s C4.5, page 18!), which included the weather conditions for two weeks:

1.      Outlook (sunny, rainy, overcast)

2.      Temperature

3.      Humidity

4.      Windy (yes, no), and of course

5.      Play (yes, no).

I built the easy-to-interpret LRM, and the not-so-easy-to-interpret GenIQ Model for the target variable Play (yes). This creates a counterpoint where the data analyst now can choose between a good interpretable model and a potentially better unexplainable model.

Table 1. Let’s Play Dataset

 Day Outlook Temperature Humidity Windy Play 1 sunny 85 85 no no 2 sunny 80 90 yes no 3 overcast 83 86 no yes 4 rainy 70 96 no yes 5 rainy 68 80 no yes 6 rainy 65 70 yes no 7 overcast 64 65 yes yes 8 sunny 72 95 no no 9 sunny 69 70 no yes 10 rainy 75 80 no yes 11 sunny 75 70 yes yes 12 overcast 72 90 yes yes 13 overcast 81 75 no yes 14 rainy 71 91 yes no

II. LRM Output

The LRM output (Analysis of Maximum Likelihood Estimates) and arguably the best Play-LRM equation are below.

Analysis of Maximum Likelihood Estimates

Standard          Wald

Parameter           DF       Estimate           Error      Chi-Square    Pr > ChiSq

Intercept              1          11.7403        7.0076             2.8069           0.0939

Outlook(sunny)   1          -2.2682         1.5631            2.1057           0.1468

Humidity             1          -0.1124         0.0768            2.1423           0.1433

Windy(yes)          1          -2.0470         1.5612            1.7192          0.1898

(Log of odds of) Play (yes) =

11.7403 - 2.2682*Outlook(sunny) - 0.1124*Humidity - 2.0470*Windy(yes)

III. Play-LRM Results

The results of the Play-LRM are in Table 2. There is not a perfect rank-order prediction of Play for days 6, 1, 12 and 11.

Table 2. LRM Rank-order Prediction of Play

 Day Outlook Temperature Humidity Windy Play log of odds of Play 13 overcast 81 75 no yes 3.313367 5 rainy 68 80 no yes 2.751571 10 rainy 75 80 no yes 2.751571 7 overcast 64 65 yes yes 2.389934 3 overcast 83 86 no yes 2.077415 6 rainy 65 70 yes no 1.828138 9 sunny 69 70 no yes 1.607004 4 rainy 70 96 no yes 0.953822 1 sunny 85 85 no no -0.07839 12 overcast 72 90 yes yes -0.41905 11 sunny 75 70 yes yes -0.44002 14 rainy 71 91 yes no -0.53141 8 sunny 72 95 no no -1.20198 2 sunny 80 90 yes no -2.68721

IV. GenIQ Model Output

The Play-GenIQ Model tree display, and its form (computer program) are below.

If outlook = "overcast" Then x1 = 1; Else x1 = 0;

If windy = "no" Then x2 = 1; Else x2 = 0;

If outlook = "rainy" Then x3 = 1; Else x3 = 0;

x2 = x2 * x3;

x1 = x1 + x2;

If outlook = "rainy" Then x2 = 1; Else x2 = 0;

x3 = humidity;

x2 = x2 + x3;

x3 = humidity;

x4 = temperature;

If x3 NE 0 Then x3 = x4 / x3; Else x3 = 1;

If x2 NE 0 Then x2 = x3 / x2; Else x2 = 1;

x1 = x1 + x2;

GenIQvar = x1;

V. GenIQ Variable Selection

GenIQ variable selection provides a rank-ordering of variable importance for a predictor variable  with respect to other predictor variables considered jointly. This is in stark contrast to the well-known, always-used statistical correlation coefficient, which only provides a simple correlation between a predictor variable and the target variable - independent of the other predictor variables under consideration.

Variable Importance (w/r/to other variables considered jointly)

1.      Outlook (overcast)

2.      Outlook (rainy)

3.      Windy (no)

4.      Humidity

5.      Outlook (sunny)

6.      Windy (yes)

7.      Temperature

VI. GenIQ Data Mining

GenIQ data mining is directly apparent from the GenIQ tree itself: Each branch is a newly constructed variable, which has power to increase the rank-order predictions.

1.      Var1 = Temperature / Humidity

2.      Var2 = Humidity + Outlook (rainy)

3.      Var3 = Var1 / Var2

4.      Var4 = Outlook (rainy) * Windy (no)

5.      Var5 = Var4 + Outlook (overcast)

6.      GenIQ Model = Var3 + Var5

VII. Play-GenIQ Model Results

The results of the Play-GenIQ Model are in Table 3. There is a perfect rank-order prediction of Play.

Table 3. GenIQ Model Rank-order Prediction of Play

 Day Outlook Temperature Humidity Windy Play GenIQvar 7 overcast 64 65 yes yes 1.015148 13 overcast 81 75 no yes 1.014400 10 rainy 75 80 no yes 1.011574 3 overcast 83 86 no yes 1.011222 5 rainy 68 80 no yes 1.010494 12 overcast 72 90 yes yes 1.008889 4 rainy 70 96 no yes 1.007517 11 sunny 75 70 yes yes 0.015306 9 sunny 69 70 no yes 0.014082 6 rainy 65 70 yes no 0.013078 1 sunny 85 85 no no 0.011765 2 sunny 80 90 yes no 0.009877 14 rainy 71 91 yes no 0.008481 8 sunny 72 95 no no 0.007978

IIX. Summary

The machine learning paradigm (MLP) “let the data suggest the model” is a practical alternative to the statistical paradigm “fit the data to the LRM equation,” which has its roots when data were only “small.”  It was – and still is – reasonable to fit small data to a rigid parametric, assumption-filled model. However, the current information (big data) in, say, cyberspace require a paradigm shift. MLP is a utile approach for database response modeling when dealing with big data, as big data can be difficult to fit in a specified model. Thus, MLP can function alongside the regnant statistical approach when the data – big or small – simply do not “fit.” As demonstrated with the “Let’s Play” data, MLP works well within small data settings.