Linear model vs decision tree (in R)

I’ve used the R statistical language a bit, a long time ago. It was my first real encounter with data science, but future encounters used Matlab and Python. But lately I’ve been picking up R again, as it’s popular in the data science community.

As practise / demo, I thought I’d do a simple exploration of the strengths and weaknesses of linear models versus decision trees. This was inspired by Claudia Perlich at kdnuggets.

Let’s get started!

# the `predict_grid` function is attached
require(lattice)
require(rpart)

We’ll seed the random number generator so we get reproducible results:

N = 1000
set.seed(123456789)

Interactive data

Let’s try some data where the dependent variable depends strongly on the interaction between two inputs.

x = runif(N) * 2 - 1
y = runif(N) * 2 - 1
z = ifelse(x > y, 1, 0)

which looks like this

xyplot(x ~ y, group=z, grid=TRUE, asp=1)

Interactive data - linear fit

First let’s see how a linear fit performs:

linmod = glm(z ~ x + y, family=binomial(link='logit'))
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
summary(linmod)

Call:
glm(formula = z ~ x + y, family = binomial(link = "logit"))

Deviance Residuals: 
      Min         1Q     Median         3Q        Max  
-0.003177   0.000000   0.000000   0.000000   0.003253  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)     16.0      851.5   0.019    0.985
x             3660.7    55664.0   0.066    0.948
y            -3695.0    56387.9  -0.066    0.948

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3861e+03  on 999  degrees of freedom
Residual deviance: 2.1237e-05  on 997  degrees of freedom
AIC: 6

Number of Fisher Scoring iterations: 25

Note that we get a warning because the results are too perfect: the linear model can separate these classes easily.

So what do the predictions look like graphically?

data = predict_grid(linmod, size=25);

which is perfect. A linear model can predict interactive data as long as it’s linear.

Interactive data - decision trees fit

Let’s see if decision trees can do the same.

dtmod = rpart(z ~ x + y,    method="class")
# summary(dtmod)  # more verbose results
dtmod
n= 1000 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 1000 493 0 (0.50700000 0.49300000)  
   2) x< 0.09964573 548 138 0 (0.74817518 0.25182482)  
     4) y>=-0.2762061 365  14 0 (0.96164384 0.03835616)  
       8) y>=-0.1460217 326   4 0 (0.98773006 0.01226994) *
       9) y< -0.1460217 39  10 0 (0.74358974 0.25641026)  
        18) x< -0.1961216 29   0 0 (1.00000000 0.00000000) *
        19) x>=-0.1961216 10   0 1 (0.00000000 1.00000000) *
     5) y< -0.2762061 183  59 1 (0.32240437 0.67759563)  
      10) x< -0.598692 65  15 0 (0.76923077 0.23076923)  
        20) y>=-0.7330697 49   2 0 (0.95918367 0.04081633) *
        21) y< -0.7330697 16   3 1 (0.18750000 0.81250000) *
      11) x>=-0.598692 118   9 1 (0.07627119 0.92372881) *
   3) x>=0.09964573 452  97 1 (0.21460177 0.78539823)  
     6) y>=0.399141 135  46 0 (0.65925926 0.34074074)  
      12) x< 0.6707242 80   5 0 (0.93750000 0.06250000) *
      13) x>=0.6707242 55  14 1 (0.25454545 0.74545455)  
        26) y>=0.8781129 10   0 0 (1.00000000 0.00000000) *
        27) y< 0.8781129 45   4 1 (0.08888889 0.91111111) *
     7) y< 0.399141 317   8 1 (0.02523659 0.97476341) *

There’s a lot of text, but what does it look like?

predict_grid(dtmod, size=25);
[1] 0

So the decision tree cannot really explain the interaction. The boundary is parallel to the x or y axis at any time, it does not take interaction into account. At most it can approximate this boundary with a really deep tree.

Non-linear data

First to make the data and show it:

x = runif(N) * 2 - 1
y = runif(N) * 2 - 1
z = ifelse(abs(x) < sqrt(2) / 2 & abs(y) < sqrt(2) / 2, 1, 0)
xyplot(x ~ y, group=z, grid=TRUE, asp=1)

Non-linear data - linear fit

We make a linear fit first. We know how to do that now, so let’s skip to the results:

[1] 0

As the terms “non-linear data” and “linear fit” suggest, they don’t mix very well. There’s just a slight gradient due to randomness, but nothing approximating a solution.

This is the case because the data isn’t separable by a straight line, not even approximately. Linear models have straight decision boundaries, so it can’t find a solution.

Non-linear data - decision trees fit

Let’s see how a decision tree does.

which is exactly right. Decision trees are good when there are no interactions, and boundaries are parallel to the coordinate axes.

XOR data

Let’s try another one: learn the binary XOR operation. Let’s do this quickly:

x = as.integer(rnorm(N) > 0)
y = as.integer(rnorm(N) > 0)
x[1:10]
 [1] 0 0 0 1 1 0 1 1 0 1
y[1:10]
 [1] 1 1 0 0 1 1 0 1 1 0
z = 1 - (x == y)
z[1:10]
 [1] 1 1 0 1 0 1 1 0 1 1

which looks like:

The linear and decision tree results look like:

So although this problem has interactions, due to the limited possiblities, they can also be represented as simple horizontal and vertical boundaries, and the decision tree performs well.

Since there are so few possiblities, the notion of linearity is not very interesting (xor is not linear in reals but is in some rings). Since the number of cutoffs needed is very limited, the decision tree performs well.

However, the linear model cannot solve this problem. This could have been expected, since the decision boundary is a straight line, and there is no straight line that solves this problem (in the original coordinates, at least).

Interactive, non-linear data

Say we want to feel superior to computers. Let’s make some data that’s simple for us but hard for the computer, by making it non-linear and interactive.

x = runif(N) * 2 - 1
y = runif(N) * 2 - 1
z = ifelse(abs(x - y) > 2 - sqrt(2), 1, 0)
xyplot(x ~ y, group=z, grid=TRUE, asp=1)

Haha, look at the sillycomputers messing it up!

Extra parameter - linear fit

But let’s say we help the models just a little bit. We’ll add a simple third parameter:

w = abs(x - y)

Here’s the result for the linear fit:

glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred

Suddenly it got the perfect result!

This happened becausein the new w variable, there is no interaction. The region in the left top and the region in the right bottom now overlap, making the problem linearly separable.

Extra parameter - decision tree

For the decision tree, we will use the same parameter w. Here is the result:

The decision tree, too, got it perfect! But the reason is different. The new parameter helped the decision tree because the w axis is perpendicular to the decision boundary. In w space, the cutoff is a simple point, not a curve as in x,y space.

Note that w = y - x, without the absolute value, would have also worked for the decision tree. It’d just be two cutoff points, but the boundary remains perpendicular. But it wouldn’t have helped the linear model at all, since there’d not be any linear decision boundary that solves the problem.

Here is a plot of x - y to illustrate the result:

v = x - y
w = abs(x - y)
xyplot(z ~ v)

A simple one-dimensional problem!

Previous models

Can you think of some new coordinates that solve the earlier difficulties the decision tree had for the “diagonal data”, and the linear model for the “square box data”?

Did you do it?

Here is one solution:

  • Diagonal data: adding a parameter w = x - y makes the separation line perpendicular to w. The problem is then very easy to solve for the decision tree (it was already easy for the linear model).
  • Square box data: adding a parameter w which is the higher one of abs(x) and abs(y), makes the problem easily solvable with a decision boundary at w = sqrt(2) / 2.

Today’s lessons

  • Linear models are bad at interactions (that can’t be separated by a straight line).
  • Decision trees are bad at ‘curved’ decision boundaries (that aren’t parallel to a coordinate).
  • Both can be improved a lot by smart choice of extra features. The choices should be focussed on overcomming their weakness, and are generally different for each model.

Impressions about R

I’m a novice at R, so don’t pay too much attention to this, I’ll probably change my mind.

  • It’s harder to find things on Google for me. I regularly get results for other languages when searching for R stuff. Maybe Google needs some time to figure our I’m doing R now.
  • Rstudio is quite nice. Similar to Matlab. It’s intuitive, although problem highlighting could be better/faster. It’s no match for PyCharm, but I really love PyCharm, so that’s not a very meaningful comparison.
  • I like R notebooks more than Jupyter ones, it feels like you’re working in one file instead of a hundred tiny ones pasted together, and it’s more smooth in general.
  • I’m finding the style of R less predictable than Python’s. Maybe I’m just not used to it yet. But it’s a bit confusing when things are ‘values’ and when they are ‘symbols’ whose name should match next time. And what dots mean in which case. Or why things about objects aren’t just properties of the object (especially dim(arr) <- value seems really weird). Or why array slices aren’t as simple as in Python (or even Matlab) - why would a function like head need to exist? Altogether it feels like a collection of tools, rather than a language (not necessarily bad for it’s purpose).
  • Functions with ~ seem cool, I’ll have to study that more.
---
title: "Linear models vs decision trees"
output:
  html_document:
    fig_height: 4
    fig_width: 4
  html_notebook:
    fig_height: 4
    fig_width: 4
---

I've used the R statistical language a bit, a long time ago. It was my first real encounter with data science, but future encounters used Matlab and Python. But lately I've been picking up R again, as it's popular in the data science community.

As practise / demo, I thought I'd do a simple exploration of the strengths and weaknesses of linear models versus decision trees. This was inspired by [Claudia Perlich at kdnuggets](http://www.kdnuggets.com/2016/12/interviews-data-scientists-claudia-perlich.html).

Let's get started!

```{r, echo=FALSE, quiet=TRUE}
rm(list=ls())
source('vis_pred.R')
```
```{r, quiet=TRUE}
# the `predict_grid` function is attached
require(lattice)
require(rpart)
```

We'll seed the random number generator so we get reproducible results:

```{r}
N = 1000
set.seed(123456789)
```

Great, let's get started!

Interactive data
===============================

Let's try some data where the dependent variable depends strongly on the interaction between two inputs.

```{r}
x = runif(N) * 2 - 1
y = runif(N) * 2 - 1
z = ifelse(x > y, 1, 0)
```

which looks like this

```{r}
xyplot(x ~ y, group=z, grid=TRUE, asp=1)
```

Interactive data - linear fit
-------------------------------

First let's see how a linear fit performs:

```{r}
linmod = glm(z ~ x + y, family=binomial(link='logit'))
summary(linmod)
```

Note that we get a warning because the results are *too* perfect: the linear model can separate these classes easily.

So what do the predictions look like graphically?

```{r}
data = predict_grid(linmod, size=25);
```

which is perfect. A linear model can predict interactive data as long as it's linear.

Interactive data - decision trees fit
-------------------------------

Let's see if decision trees can do the same.

```{r}
dtmod = rpart(z ~ x + y, 	method="class")
# summary(dtmod)  # more verbose results
dtmod
```

There's a lot of text, but what does it look like?

```{r}
predict_grid(dtmod, size=25);
```

So the decision tree cannot really explain the interaction. The boundary is parallel to the x or y axis at any time, it does not take interaction into account. At most it can approximate this boundary with a really deep tree.

```{r, echo=FALSE, quiet=TRUE}
w = x - y
#dtmod = rpart(z ~ w + x + y, 	method="class")
#predict_grid(dtmod, size=50, true_cls_func=function (x, y) { return(ifelse(x > y, 1, 0)) }, extra_param_func=function(x, y) { return(x - y) });
```

Non-linear data
===============================

First to make the data and show it:

```{r}
x = runif(N) * 2 - 1
y = runif(N) * 2 - 1
z = ifelse(abs(x) < sqrt(2) / 2 & abs(y) < sqrt(2) / 2, 1, 0)
xyplot(x ~ y, group=z, grid=TRUE, asp=1)
```

Non-linear data - linear fit
-------------------------------

We make a linear fit first. We know how to do that now, so let's skip to the results:

```{r, echo=FALSE, quiet=TRUE}
linmod = glm(z ~ x + y, family=binomial(link='logit'))
predict_grid(linmod, size=25);
```

As the terms "non-linear data" and "linear fit" suggest, they don't mix very well. There's just a slight gradient due to randomness, but nothing approximating a solution.

This is the case because the data isn't separable by a straight line, not even approximately. Linear models have straight decision boundaries, so it can't find a solution.

```{r, echo=FALSE, quiet=TRUE}
get_w = function(x, y)
{
  N = length(x)
  v = matrix(0, length(x))
  for (i in 1:length(x)) {
    v[i] = ifelse(abs(x[i]) > abs(y[i]), abs(x[i]), abs(y[i]))
  }
  return(v)
}
w = get_w(x, y)
xyplot(z ~ w)
#linmod = glm(z ~ w + x + y, family=binomial(link='logit'))
#predict_grid(linmod, size=50, true_cls_func=function (x, y) { return(ifelse(abs(x) < sqrt(2) / 2 & abs(y) < sqrt(2) / 2, 1, 0)) }, extra_param_func=get_w);
```

Non-linear data - decision trees fit
-------------------------------

Let's see how a decision tree does.

```{r, echo=FALSE}
dtmod = rpart(z ~ x + y, 	method="class")
pred = predict_grid(dtmod, size=25, true_cls_func=function(x, y) { ifelse(abs(x) < sqrt(2) / 2 & abs(y) < sqrt(2) / 2, 1, 0) });
```

which is exactly right. Decision trees are good when there are no interactions, and boundaries are parallel to the coordinate axes.

XOR data
===============================

Let's try another one: learn the binary XOR operation. Let's do this quickly:

```{r}
x = as.integer(rnorm(N) > 0)
y = as.integer(rnorm(N) > 0)
x[1:10]
y[1:10]
z = 1 - (x == y)
z[1:10]
```

which looks like:

```{r, echo=FALSE}
# xyplot(x ~ y, col=ifelse(z, 'red', 'blue'), grid=TRUE)  # explicitly set the color
xyplot(x ~ y, group=z, grid=TRUE, asp=1)
# use this to save:
# dev.copy(png, filename="xor.png");
# dev.off();
```

The linear and decision tree results look like:

```{r, echo=FALSE}
linmod = glm(z ~ x + y, family=binomial(link='logit'))
data = predict_grid(linmod, dims=c(0, 1, 0, 1), size=25);
```

```{r, echo=FALSE}
dtmod = rpart(z ~ x + y, 	method="class")
data = predict_grid(dtmod, dims=c(0, 1, 0, 1), size=25);
```

So although this problem has interactions, due to the limited possiblities, they can also be represented as simple horizontal and vertical boundaries, and the decision tree performs well. 

Since there are so few possiblities, the notion of linearity is not very interesting (xor is not linear in reals but is in some rings). Since the number of cutoffs needed is very limited, the decision tree performs well.

However, the linear model cannot solve this problem. This could have been expected, since the decision boundary is a straight line, and there is no straight line that solves this problem (in the original coordinates, at least).

Interactive, non-linear data
===============================

Say we want to feel superior to computers. Let's make some data that's simple for us but hard for the computer, by making it non-linear and interactive.

```{r}
x = runif(N) * 2 - 1
y = runif(N) * 2 - 1
z = ifelse(abs(x - y) > 2 - sqrt(2), 1, 0)
xyplot(x ~ y, group=z, grid=TRUE, asp=1)
```

```{r, echo=FALSE}
linmod = glm(z ~ x + y, family=binomial(link='logit'))
data = predict_grid(linmod, size=25);
```

```{r, echo=FALSE}
dtmod = rpart(z ~ x + y, 	method="class")
data = predict_grid(dtmod, size=25);
```

Haha, look at the sillycomputers messing it up!

Extra parameter - linear fit
-------------------------------

But let's say we help the models just a little bit. We'll add a simple third parameter:

```{r}
w = abs(x - y)
```

Here's the result for the linear fit:

```{r, echo=FALSE}
linmod = glm(z ~ w + x + y, family=binomial(link='logit'))
data = predict_grid(linmod, size=25, true_cls_func=function (x, y) { return(ifelse(abs(x - y) > 2 - sqrt(2), 1, 0)) }, extra_param_func=function(x, y) { return(abs(x - y)) });
```

Suddenly it got the perfect result!

This happened becausein the new ``w`` variable, there is no interaction. The region in the left top and the region in the right bottom now overlap, making the problem linearly separable.

Extra parameter - decision tree
-------------------------------

For the decision tree, we will use the same parameter ``w``. Here is the result:

```{r, echo=FALSE}
dtmod = rpart(z ~ w + x + y, 	method="class")
data = predict_grid(dtmod, size=50, true_cls_func=function (x, y) { return(ifelse(abs(x - y) > 2 - sqrt(2), 1, 0)) }, extra_param_func=function(x, y) { return(abs(x - y)) })
```

The decision tree, too, got it perfect! But the reason is different. The new parameter helped the decision tree because the ``w`` axis is perpendicular to the decision boundary. In ``w`` space, the cutoff is a simple point, not a curve as in ``x,y`` space.

Note that ``w = y - x``, without the absolute value, would have also worked for the decision tree. It'd just be two cutoff points, but the boundary remains perpendicular. But it wouldn't have helped the linear model at all, since there'd not be any linear decision boundary that solves the problem.

Here is a plot of ``x - y`` to illustrate the result:

```{r}
v = x - y
w = abs(x - y)
xyplot(z ~ v)
```

A simple one-dimensional problem!

Previous models
-------------------------------

Can you think of some new coordinates that solve the earlier difficulties the decision tree had for the "diagonal data", and the linear model for the "square box data"?

Did you do it?

Here is one solution:

* Diagonal data: adding a parameter ``w = x - y`` makes the separation line perpendicular to ``w``. The problem is then very easy to solve for the decision tree (it was already easy for the linear model).
* Square box data: adding a parameter ``w`` which is the higher one of ``abs(x)`` and ``abs(y)``, makes the problem easily solvable with a decision boundary at ``w = sqrt(2) / 2``.

Today's lessons
===============================

* Linear models are bad at interactions (that can't be separated by a straight line).
* Decision trees are bad at 'curved' decision boundaries (that aren't parallel to a coordinate).
* Both can be improved a lot by smart choice of extra features. The choices should be focussed on overcomming their weakness, and are generally different for each model.

Impressions about R
===============================

I'm a novice at R, so don't pay too much attention to this, I'll probably change my mind.

* It's harder to find things on Google for me. I regularly get results for other languages when searching for R stuff. Maybe Google needs some time to figure our I'm doing R now.
* Rstudio is quite nice. Similar to Matlab. It's intuitive, although problem highlighting could be better/faster. It's no match for [PyCharm](https://markv.nl/blog/pycharm), but I really love PyCharm, so that's not a very meaningful comparison.
* I like R notebooks more than Jupyter ones, it feels like you're working in one file instead of a hundred tiny ones pasted together, and it's more smooth in general.
* I'm finding the style of R less predictable than Python's. Maybe I'm just not used to it yet. But it's a bit confusing when things are 'values' and when they are 'symbols' whose name should match next time. And what dots mean in which case. Or why things about objects aren't just properties of the object (especially ``dim(arr) <- value`` seems really weird). Or why array slices aren't as simple as in Python (or even Matlab) - why would a function like `head` need to exist? Altogether it feels like a collection of tools, rather than a language (not necessarily bad for it's purpose).
* Functions with ``~`` seem cool, I'll have to study that more.




Comments

No comments yet

You need to be logged in to comment.