Suppose we're given the data set $\{x_1 \dots x_n\}$ in $\mathbb{R}^D$ the $D$-dimensional Euclidean space, and assume this data has intrinsic dimension $d < D.$ N.B. this just means that data is lying on a $d$-dimensional connected manifold which can be very flat or very curved but certainly it doesn't assume the manifold is a hyperplane, i.e. linear; and assume we've no idea about the curvature of this manifold: we just know that the manifold is $d$-dimensional.
Suppose we're interested in determining whether the above manifold, or equivalently the data has an linear structure or not, i.e. if the manifold above is a $d$-dimensional hyperplane or not. This means, we want to test, does this manifold have zero curvature and is it homeomorphic to an open subset of a Euclidean space? To ease things, assume the manifold is homeomorphic to an open subset of a Euclidean space, then test if the sectional curvature of this manifold is identically zero.
Is it possible to test, and if yes, what test(s) do we need? More specifically: what I want is a test of hypothesis following the steps:
(1) Construct a suitable statistic $\Theta(X_1, \dots X_n)$ that's representative of the linearity of the data.
(2) Determine the sampling distribution of the statistic, and if needed, the limiting distribution of $\Theta(X_1, \dots X_n)$ as $n \to \infty.$
(3) Accept the null hypothesis $H_0:$ the data is linear if for a chosen threshold $\theta, \Theta(X_1, \dots X_n) \le \theta, $ and reject it otherwise.
Let's consider two specific examples of data, where I'd like to find whether the data is linear or not, with certain confidence from the test described above. I'd rather prefer the answer for the second example.
Example I: You may consider the the data coming from the $2$-sphere and then embedded by zero padding in $\mathbb{R}^{50}$, so consider the data $x_i$ sampled from $M:=\{(x, y, z, 0, \dots 0): x^2 + y ^2 + z^2 =1\}, 0$ occurring $47$ times in this expression so that $M \subset \mathbb{R}^{50}.$ Now clearly in this case, $M$ is two dimensional, and not a linear subspace of $\mathbb{R}^{50},$ so the test I'm asking for would answer in the negative - it'd tell us that the samples came from a nonlinear manifold, and not a linear hyperplane.
Example II: Perhaps Example I above was a bit easy, so consider instead $100$ data points $x_i \in \mathbb{R}^{50}, x_i= (y_i, 0), 0$ occurs $47$ times, and $y_i$'s come from the manifold $\{(x,y,z): x^2 y^3 z + sin(xy) cos(yz) + tan(y-z +1) - xy^2 e^z - xyz + cos(yz) - xy + z - 5=0\}$. The reason to cite this example is that unlike the Example I, it's not a linear function of functions of one variables, as the example I was a linear/affine function of $x^2, y^2, z^2.$
So I see that many of you've suggested PCA, and perhaps because of my own background, I'm having trouble to understand how exactly it helps us infer if the manifold $M$ is linear or not. Say, given $d,$ I do the PCA, and find the best approximating $d$-dimensional hyperplane approximating the data (or equivalently, maximizing the variance). I'm okay with this so far - but what do we do next? What's the statistic in question that'd help me accept or reject the null hypothesis that the data was linear?


Regarding your second comment, no I don't want that - yes if you scale the data enough, then it'd lie arbitrarily close to zero, so given any hyperplane $H$, the whole dataset is within $\epsilon$ dist - distance from $H.$ So I guess this makes my characterizing "almost linear" just by using dist a bad way to do so. Hmm...I'll think, but for sure scaling shouldn't affect anything. Thanks for pointing that out!
– Mathmath May 20 '20 at 14:37