How to fit logistic regression to circular data?

Question

I've made a script that can do normal logistic regression with sigmoid(linear model). However, I have data that has a circular decision boundary and looks like this.

My question is how I can modify my script handle these data? I am thinking of including the circular equation, but I am not sure how.

Convert to polar co-ordinates (https://en.wikipedia.org/wiki/Polar_coordinate_system) — Dikran Marsupial, Sep 13 '21 at 12:10
It looks to me if $x_0^2 + x_1^2 \leq 1$ then $y=1$ and otherwise $y=0$. Could you use this as a feature/decision criteria? I.e. distance of $(x_0, x_1)$ from the origin. — jcken, Sep 13 '21 at 12:10
@DikranMarsupial I think a better option (making fewer assumptions) is to include both linear and quadratic effects of $x_1$ and $x_2$ as well as their interaction (unless it is known a priori that the region is perfectly circular rather than elliptic and centered at the origin). But it all depends on what fits the data and what is known a priori about the data generating process. — Jarle Tufto, Sep 13 '21 at 14:50
@JarleTufto when I am teaching ML, I often use a similar example to motivate radial basis function neural networks and kernel learning methods, so I would probably go that route (with regularisation) to make a general non-linear version. However the nice thing about 2D data is we can just look at the data and use intuition to find a sensible transformation. — Dikran Marsupial, Sep 13 '21 at 16:24
A worked example for this kind of problem (arising from a different question) is presented in https://stats.stackexchange.com/questions/164048/can-a-random-forest-be-used-for-feature-selection-in-multiple-linear-regression/164068#164068 — Sycorax, Mar 07 '22 at 17:45
@DikranMarsupial Do you want to turn your comment into an answer? If not, I may try. — Peter Flom, Dec 16 '23 at 11:59

score 0 · Answer 1 · answered Dec 16 '23 at 18:44

A few ideas.

When I first read this question, I thought "just measure distance from the center." Then I read the comments. Dikran Marsupial suggested converting to polar coordinates, which is quite similar to what I had thought of.

In another comment, Sycorax referenced a thread where he suggests using random forests for a similar problem.

Sycorax's example is very well worked out, so I won't say more about that here. But I will modify a bit of his code code to create data.

Let's try out mine and Dikran's. First, I'll make some data that looks more or less like what the question has (I increased the sample size and added some noise):

set.seed(1234)
N  <- 1000
x1 <- rnorm(N, sd=1.5)
x2 <- rnorm(N, sd=1.5)
y  <- apply(cbind(x1, x2), 1, function(x) (x%*%x)<1)  + rnorm(1000, 0, .1)
plot(x1, x2, col=ifelse(abs(y) < 0.8, "red", "blue"))

This produces

which seems close. Now, just to check, let's try regression on this:

m1 <- lm(y~x1+x2)
summary(m1)

and, as expected, the parameter estimates are very close to 0 and not close to significant. So, let's find the polar coordinates:

r <- sqrt(x1^2 + x2^2)
theta <- atan2(x2, x1)

and regress on r and $\theta$ (I don't expect $\theta$ to be useful). Lo and behold, it works:

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59991 -0.26826 -0.07653  0.27481  0.99404
Coefficients:
              Estimate Std. Error t value Pr(>|t|)

(Intercept)  0.6992665  0.0222995  31.358   <2e-16 ***
r           -0.2606233  0.0106316 -24.514   <2e-16 ***
theta       -0.0003919  0.0057505  -0.068    0.946

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3344 on 997 degrees of freedom
Multiple R-squared:  0.3762,    Adjusted R-squared:  0.3749 
F-statistic: 300.6 on 2 and 997 DF,  p-value: < 2.2e-16

Of course, you could try different distance measures or modify things other ways.

Although the plot in the question is suggestive, the question itself does not state that the origin is known. Shouldn't that be a parameter to estimate as well? — whuber, Dec 16 '23 at 19:04
Well, I guess I took the "suggestion" in the plot seriously. You could center the data or you could center them. That would be more general. — Peter Flom, Dec 16 '23 at 19:15
It would be incorrect, though, because it would use your "center" -- however you compute it -- as an estimate. Moreover, you wouldn't be able to determine an appropriate standard error for that estimate. — whuber, Dec 16 '23 at 22:32

How to fit logistic regression to circular data?

1 Answers1