Statistics versus Machine Learning:

A Significant Difference for Database Response Modeling

Bruce Ratner, Ph.D.


 

The regnant statistical paradigm for database response modeling is: The data analyst fits the data to the presumedly true logistic regression model (LRM), which has the form (equation) of (log of the odds of) response is the sum of weighted predictor variables. The predictor variables are determined by a mixture of well-established variable selection methods and the will of the data analyst to re-express the original variables and construct new variables (data mining). The weights, better known as the regression coefficients, are determined by the pre-programmed machine-crunching method of calculus. The purpose of this article is to show a significant difference for database response modeling when implementing the antithetical machine learning paradigm: The data suggests the “true” model form, as the machine learning process acquires knowledge of the form without being explicitly programmed.

 

I use the machine learning GenIQ Model© and LRM to build a database response model, which predicts the rank-order likelihood of response, to illustrate the advantages and the singular weakness of the machine learning paradigm. Specifically, the GenIQ Model shows the superiority of the machine learning paradigm over the statistical paradigm, as it not only specifies the true model form (a computer program), but simultaneously performs variable selection and data mining. The difficulty in interpreting the computer program often accounts for the limited use of the machine learning paradigm. For a preview of the 9-step modeling process of GenIQ, click here. For FAQs about GenIQ, click here.

 

Outline of Article

I. Situation

When my daughter Amanda was in grade school, she could not understand the decision-making process of her principal Dr. Katz. On some rainy days, Dr. Katz would permit the class to go outside for recess to play. On other days when it was sunny, Dr. Katz would said, “no play.”  As a statistician’s daughter, Amanda collected some weather information, and asked me to build a model to predict what Dr. Katz will do in the days to come. Amanda created a “Let’s Play” database, in Table 1 (also in Quinlan’s C4.5, page 18!), which included the weather conditions for two weeks:

1.      Outlook (sunny, rainy, overcast)

2.      Temperature

3.      Humidity

4.      Windy (yes, no), and of course

5.      Play (yes, no).

 

I built the easy-to-interpret LRM, and the not-so-easy-to-interpret GenIQ Model for the target variable Play (yes). This creates a counterpoint where the data analyst now can choose between a good interpretable model and a potentially better unexplainable model.

   

       

 

 

Table 1. Let’s Play Dataset

Day

Outlook

Temperature

Humidity

Windy

Play

1

sunny

85

85

no

no

2

sunny

80

90

yes

no

3

overcast

83

86

no

yes

4

rainy

70

96

no

yes

5

rainy

68

80

no

yes

6

rainy

65

70

yes

no

7

overcast

64

65

yes

yes

8

sunny

72

95

no

no

9

sunny

69

70

no

yes

10

rainy

75

80

no

yes

11

sunny

75

70

yes

yes

12

overcast

72

90

yes

yes

13

overcast

81

75

no

yes

14

rainy

71

91

yes

no

 

 

II. LRM Output

The LRM output (Analysis of Maximum Likelihood Estimates) and arguably the best Play-LRM equation are below.  

 

Analysis of Maximum Likelihood Estimates

 

                                                 Standard          Wald

          Parameter           DF       Estimate           Error      Chi-Square    Pr > ChiSq

          Intercept              1          11.7403        7.0076             2.8069           0.0939

          Outlook(sunny)   1          -2.2682         1.5631            2.1057           0.1468

          Humidity             1          -0.1124         0.0768            2.1423           0.1433

          Windy(yes)          1          -2.0470         1.5612            1.7192          0.1898

 

 

 (Log of odds of) Play (yes) =

 

11.7403 - 2.2682*Outlook(sunny) - 0.1124*Humidity - 2.0470*Windy(yes)

 

 

 

 

III. Play-LRM Results

The results of the Play-LRM are in Table 2. There is not a perfect rank-order prediction of Play for days 6, 1, 12 and 11. 

 

Table 2. LRM Rank-order Prediction of Play

Day

Outlook

Temperature

Humidity

Windy

Play

log of odds of Play

13

overcast

81

75

no

yes

3.313367

5

rainy

68

80

no

yes

2.751571

10

rainy

75

80

no

yes

2.751571

7

overcast

64

65

yes

yes

2.389934

3

overcast

83

86

no

yes

2.077415

6

rainy

65

70

yes

no

1.828138

9

sunny

69

70

no

yes

1.607004

4

rainy

70

96

no

yes

0.953822

1

sunny

85

85

no

no

-0.07839

12

overcast

72

90

yes

yes

-0.41905

11

sunny

75

70

yes

yes

-0.44002

14

rainy

71

91

yes

no

-0.53141

8

sunny

72

95

no

no

-1.20198

2

sunny

80

90

yes

no

-2.68721

 

 

 

 

IV. GenIQ Model Output

The Play-GenIQ Model tree display, and its form (computer program) are below.  

 

If outlook = "overcast" Then x1 = 1; Else x1 = 0;

         If windy = "no" Then x2 = 1; Else x2 = 0;

              If outlook = "rainy" Then x3 = 1; Else x3 = 0;

         x2 = x2 * x3;

    x1 = x1 + x2;

         If outlook = "rainy" Then x2 = 1; Else x2 = 0;

              x3 = humidity;

         x2 = x2 + x3;

              x3 = humidity;

                   x4 = temperature;

              If x3 NE 0 Then x3 = x4 / x3; Else x3 = 1;

         If x2 NE 0 Then x2 = x3 / x2; Else x2 = 1;

    x1 = x1 + x2;

GenIQvar = x1;

 

 

V. GenIQ Variable Selection

GenIQ variable selection provides a rank-ordering of variable importance for a predictor variable  with respect to other predictor variables considered jointly. This is in stark contrast to the well-known, always-used statistical correlation coefficient, which only provides a simple correlation between a predictor variable and the target variable - independent of the other predictor variables under consideration.

 

Variable Importance (w/r/to other variables considered jointly)

1.      Outlook (overcast)

2.      Outlook (rainy)

3.      Windy (no)

4.      Humidity

5.      Outlook (sunny)

6.      Windy (yes)

7.      Temperature

 

VI. GenIQ Data Mining

GenIQ data mining is directly apparent from the GenIQ tree itself: Each branch is a newly constructed variable, which has power to increase the rank-order predictions.

1.      Var1 = Temperature / Humidity

2.      Var2 = Humidity + Outlook (rainy)

3.      Var3 = Var1 / Var2

4.      Var4 = Outlook (rainy) * Windy (no)

5.      Var5 = Var4 + Outlook (overcast)

6.      GenIQ Model = Var3 + Var5

 

 

 

 

VII. Play-GenIQ Model Results

The results of the Play-GenIQ Model are in Table 3. There is a perfect rank-order prediction of Play. 

 

Table 3. GenIQ Model Rank-order Prediction of Play

Day

Outlook

Temperature

Humidity

Windy

Play

GenIQvar

7

overcast

64

65

yes

yes

1.015148

13

overcast

81

75

no

yes

1.014400

10

rainy

75

80

no

yes

1.011574

3

overcast

83

86

no

yes

1.011222

5

rainy

68

80

no

yes

1.010494

12

overcast

72

90

yes

yes

1.008889

4

rainy

70

96

no

yes

1.007517

11

sunny

75

70

yes

yes

0.015306

9

sunny

69

70

no

yes

0.014082

6

rainy

65

70

yes

no

0.013078

1

sunny

85

85

no

no

0.011765

2

sunny

80

90

yes

no

0.009877

14

rainy

71

91

yes

no

0.008481

8

sunny

72

95

no

no

0.007978

 

 

IIX. Summary

The machine learning paradigm (MLP) “let the data suggest the model” is a practical alternative to the statistical paradigm “fit the data to the LRM equation,” which has its roots when data were only “small.”  It was – and still is – reasonable to fit small data to a rigid parametric, assumption-filled model. However, the current information (big data) in, say, cyberspace require a paradigm shift. MLP is a utile approach for database response modeling when dealing with big data, as big data can be difficult to fit in a specified model. Thus, MLP can function alongside the regnant statistical approach when the data – big or small – simply do not “fit.” As demonstrated with the “Let’s Play” data, MLP works well within small data settings.

 

 

For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at br@dmstat1.com.


Go back to previous page.



 

Toll Free 1 800 DM STAT-1