2

I am currently developing a software in C. I have an array (matrix) that contains some coordinate points like this (125 points for now):

\begin{bmatrix}x1&x2&x3&x4&x5&...&x125\\y1&y2&y3&y4&y5&...&y125\end{bmatrix}

What I would like to do is calculating how linear each group of 5 points are and due to the source of the data, the points will never be perfecly linear in my case. I am thinking that there will be error value that shows how these 5 points far from the perfect linearity and if this error value small from the specified value, the result will show the points as linear and if not it will show as non-linear.

I drew a simple model. The algorithm will return 'linear' for the left one and 'not linear' for the right one.

enter image description here

I looked at the linear regression method, but I am not sure if this is the right way. Can you help me if there is a better and more importantly an efficient way to do that.

nomad
  • 23
  • What do you mean by "linearity" here? That the points lie exactly on a straight line? What kind of discrepancies are you expecting? Is it that there is a linear relation with a random noise that makes them lie not exactly on the line, or maybe there is some systematic error, e.g. the higher the $x$ the more the curve bends? Those could be slightly different problems. – Tim Feb 08 '23 at 10:23
  • The points will never be on a straight line, but if they are really close to the line, that's enough for me. There is not a random noise on data, the points are exact. – nomad Feb 08 '23 at 11:01

1 Answers1

1

Regression, specifically simple linear regression, is exactly how I would go about this.

Your regression line is given by a linear equation $$ y = \alpha+\beta x. $$ The coefficients $\alpha$ and $\beta$ are typically set to values that will minimize the sum of squared vertical (!) distances between the data points and the corresponding point on the line. A little mathematics (see Wikipedia) gives us explicit formulas: $$ \beta = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2} $$ and $$ \alpha = \bar{y}-\beta\bar{x},$$ where $\bar{x}$ and $\bar{y}$ are the averages of the $x_i$ and the $y_i$, respectively.

Given this regression line, I would calculate the mean squared error, i.e., the mean of the squared vertical distances between the points and the line per above:

$$ \text{MSE} = \frac{1}{n}\sum_{i=1}^n\big(y_i-(\alpha+\beta x_i)\big)^2. $$

Then you can use a cutoff for the MSE to classify your points as linear or not. This cutoff will need to depend on your $n$, on the spread of your data, and on your substantive question.


Now, you should check for yourself whether this does what you want. Take a number of "realistic" samples, plot them and the regression line, and see whether you can find a useful cutoff for the MSE that classifies your samples the way you want.

Things to look out for are, e.g., "steep" clouds of points, with a large slope $\beta$ - since the MSE looks at vertical distances, it will be large for such point clouds, which will then be more easily classified as "not linear", even though they "look" linear to us.

If this is a problem, you could look at alternative like least absolute deviations, where the regression line is calculated to minimize not the sum of squared vertical distances, but the sum of absolute vertical distances, or alternatively orthogonal regression, which minimizes the sum of orthogonal squared distances from the regression line, rather than vertical ones. However, these methods are quite a bit more complex than simple linear regression, where we have the closed form formulas above. (Another alternative: calculate the regression line as above, but then classify your point clouds by the absolute vertical or squared/absolute orthogonal distance from that "vanilla" regression line.)

Stephan Kolassa
  • 123,354
  • Your MSE expression is affected by scale of the $x_i$ and $y_i$: for example if you multiply all the $y_i$ by $k$ and rerun the regression then the MSE will be multiplied by $k^2$. So perhaps the next stage to adjust for scale and end up with something like $R^2$ – Henry Feb 08 '23 at 10:09
  • @Henry: yes, that is why I wrote that any cutoff for the MSE would need to be informed by the spread of the data. $R^2$ is a possibility, but it gets harder to interpret. – Stephan Kolassa Feb 08 '23 at 10:11
  • My points are sometimes really close to each other, x goes like 4.88, 4.89, 4.86, 4.86 and same for y. Would that be a problem? – nomad Feb 08 '23 at 10:58
  • If that is what most of your data sets look like, you should be able to adapt the MSE cutoff. What will be challenging is if some data sets look like that, while others are on a completely different scale. In such a case, you might want to scale the MSE (or take a root, or try other transformations, or look at $R^2$, as @Henry suggests) - it will really depend on the idiosyncrasies of your data. – Stephan Kolassa Feb 08 '23 at 11:28
  • This answer introduces an asymmetry that is not present in the original question. The question appears to concern deviations in the plane from a collection of points and some possible line. That's a multivariate problem solved by PCA, not by regression, as you allude in the last paragraph. Moreover, if the input points wander around the plane, it won't be meaningful to compare the MSE of one group of 5 to another group of 5. – whuber Feb 08 '23 at 15:01
  • I tried linear regression, but it classifies my points as non-linear even if they look linear to me. I plotted a sample of my points in an online graph tool. https://www.desmos.com/calculator/c7fmwrhqfk How should I modify the formula so that is gives me these points as linear? – nomad Feb 16 '23 at 14:16
  • In my answer, I recommend using a useful value of the MSE or similar, because at some point you will need to quantify how much deviation from perfect linearity is acceptable so your points look "still linear". If your current threshold classifies these points as nonlinear, you may need to use a larger threshold. Alternatively, you could look at PCA as @whuber suggests, and classify points as linear if the first principal component accounts for some minimum amount of total variance. Here is a recent thread on PCA. – Stephan Kolassa Feb 16 '23 at 14:23