4

Statistical inference or inferential statistics may be the most important topic in Statistics or all about Statistics itself.

Historically Statistical inference was developed to deal with uncertainty due to the limited size of sample compared to the population. With large size sample we don't suffer from such uncertainty; As sample size grows very big, statistical significance becomes infinitely high and even small effect can be detected with almost 100 certainty. But Statistical significance does not necessarily mean that the results are practically significant in a real-world sense of importance. Although there is found to exist an effect or a difference, it may be ignored in your field of study. So, what's important now is practical significance, about which Statistics can not tell you anything.

So I personally expect that the utility of various Statistical methods may eventually decline at least in those fields where big data set becomes available. I wonder if you disagree with this idea and if you do why you think so.

Royalblue
  • 197
  • I believe this questions has 2 parts. 1 whether big data requires statistics (in it's narrow interpretation as the study of the effects of randomness and noise in observational research) 2 whether big data is gonna replace everything in the era of big data. I find this latter aspect a bit opinion based and it is unclear what will happen in the future. While there is big data not everything is big data (e.g. think of studies on rare events) and for statistics to become obsolete you need some way for big data to replace everything (I am thinking now of Hitchhiker's guide 'deep thought'). – Sextus Empiricus Dec 04 '20 at 08:15
  • 1
    What is an example of current big data that does not use statistical inference? – Sextus Empiricus Dec 04 '20 at 08:42
  • @sextus-empiricus It's a matter of definition what you consider "statistical inference", but I'd (subjectively!) claim that image (faces, tumors, pedestrians...) or sound (voice etc.) recognition and basically everything using neural networks and, probably, many other "machine learning" techniques (again, a highly subjective classification) do not use "statistical inference". They rely on no generative model for the data and no error model, and do not (and cannot!) deliver explanatory models. – Igor F. Dec 04 '20 at 10:08
  • 1
    @IgorF But are such things as image recognition replacement of research that previously relied on 'statistical inference'? I consider such algorithms like image or speach recognition more as tools and some advanced form of computing, say, an average. – Sextus Empiricus Dec 04 '20 at 10:18
  • @sextus-empiricus That's a different question and, I believe, unjustifiably reduced to research. But, Paul Werbos developed backpropagation (THE learning method in mult-layer neural networks, ISBN: 0-471-59897-6) for questions that had earlier been answered using ordinary linear regression or ARMA. A colleague of mine used to work in finance an told me that they used machine learning algorithms and that logistic regression usually had the worst performance. So, yes, I believe that ML algorithms, requiring big data, can replace "statistical inference" in some cases. – Igor F. Dec 04 '20 at 10:55
  • 1
    @Igor that there are examples for which a more complex (machine learning) model is better than a simpler (logistic regression) model is not surprise. But how did they determine that it is better? There is still some classic statistics going on there. I believe that the sole aspect that relates to 'inference becomes obsolete' in RoyalBlue's question is when Big Data becomes 'a lot of data' such that we can make very accurate estimates of some model parameters and in a sort of central limit theorem way do not really need to use advanced techniques.... – Sextus Empiricus Dec 04 '20 at 11:36
  • ....but the machine learning methods are still relying on statistics. It is just often not in a mathematical closed form and using pretty distributions but instead uses cross validation and monte carlo techniques. – Sextus Empiricus Dec 04 '20 at 11:38

3 Answers3

6

I hope this question does not get closed as "opinion based" because, even if it is, I think it's quite relevant.

There are, in my opinion, several issues to consider. First, the question of statistical "significance". It is commonly (mis)used as a decision making tool, at least in the field where I work (medicine), although it is completely unsuitable for that purpose. One might hope that the inflation of "significant" results in big data settings would lead researchers to revise their views of the meaning of "significance" and lead them to find better decision making tools, e.g. cost/benefit calculations.

On the other hand, big data allows us to test a large number of hypotheses on a same dataset. In "classical" statistics, with "small" data, the necessary corrections of the significance level (Bonferroni etc.) quickly lead to situations where we miss an actually existing effect. In this case, big data could actually support classical statistical methods.

Another point worth considering is the importance of probabilistic models. All statistical methods are based on some assumptions regarding the (unobservable!) models underlying the observed data. These models are at least as much guided by our ability to do mathematics, as by our domain knowledge about the data. In fact, in my experience, we mostly choose the model based on our familiarity with it, or availability of the software, and not on deliberations of how data came into being. Big data might allow us to do model-free (or model-poor) data analysis and get practically more useful and perhaps even more accurate results. So, in this regard, I would tend to support your thesis.

Altogether, I thinks there are pros and cons for your thesis. Big data and machine learning methods will likely supplement, but not completely replace classical statistics. Computers and their numerical capabilities haven't made analysis obsolete, and photography didn't make painting disappear.

Igor F.
  • 9,089
4

Big data has proven itself and may use methods that rely little on "statistics" (rely little on considerations about errors and noise) but

  1. It is still far from clear whether big data is gonna replace everything.

    Are we gonna be having so much abundance of resources (measurements and computation) to be able to put every space full of thousands of sensors to gather a universe of data in order to solve a problem that could be simply attacked with a minimalistic approach?

  2. It may be doubted how scalable big data is and whether it is able to deal (in a simplistic way, in a non-advanced way) with noise and random variations. Is big data really so different from small data?

    If you got bad data that can't solve a problem because too low signal to noise ratio... can it be solved by just gathering more of it? And maybe it can be done in some cases but will it be the most efficient?

    The analysis methods of big data might be very powerful but what about underlying assumptions of the model and data gathering process? We can have an extremely high precision by taking simple averages (or some neural network that does this in a more fluid way), but if there is some systematic error then the result can still be completely wrong (think about image recognition that can be tricked). These type of errors still need to be evaluated with "classical" statistical methods that handle small data.

In addition maybe the question about 'classical statistics' being replaced by 'big data' is a loaded question, and it sketches a false dichotomy. It is wrong in the first place to think about these two as unrelated or different.

chl
  • 53,725
2

One thing that was not yet mentioned and I believe it is important is that even with big data and even data on whole population you still need statistical inference.

The reason for that is that you simply cannot directly observe the data generating process (DGP) and what the true model is.

For example, nowadays in some countries it is possible to get your hands on data on 'whole population'. You might have access to all anonymized tax records of population data (if in your study population is all people with taxable incomes) on all sales made in a country etc. As a matter of fact already even before big data revolutions you could get your hands on aggregates such as GDP, inflation, interest rates etc. for virtually any country in the world. If for a particular study that is to be applied only for European countries you would define your population as European countries you will have plethora of sources that will give you reliable macroeconomic data for whole population not just a sample.

However, does the fact that you have access to such cornucopia of data even on whole population means you can dispense with inferential statistics? Well no because even if you have data on every individual from the population you still do not know what was the underlaying data generating process. For example, if the true relationship between GDP and consumption is:

$$Y = \beta_0 + \beta_1C +u$$

where $u$ is not an error but a disturbance term then still when you make drawn from the above generating process there will be some noise that will make it impossible to calculate the true population $\beta_0$ and $\beta_1$ directly.

Another way of thinking about it is that in this case the true population is not simply the population as you can observe it in real world but also all the hypothetical population that would exist if you could observe infinite draws from the data generating process. For example, you might be able to observe what the GDP and consumption of all European countries was in 2014 - but if those observations depend on true relationship that has random disturbance you don't know if the observations would be same if you would be able to rewind the time back before 2014 and let 2014 play out again.

Consequently, even if you could collect data on a all European countries and your study is supposed to be only about Europe so it would be fine to define your population as just all European countries, you should still consider the data on all European countries for a particular year to be just a sample of all data that could have been if the randomness in the relationship would play differently.

1muflon1
  • 905
  • 1
    Well put: the question implicitly assumes that big data = knowning the DGP. I'd go even further and say that statistics is at least just as imporant, after all isn't it a frequentist's dream to have $n \to \infty$ ? :) And even if you have the maximum $n$, i.e. the whole "physical population", that is not necessarily the same as the "statistical population" (which is something abstract), since looking at the same variables tomorrow may show different values, which essentially would count as a different sample. You'd still need significance tests for doing best subset selection e.g. – PaulG Dec 04 '20 at 11:58
  • Two points: 1) Re: "you still need statistical inference": For what do we need it? If we want to find the most likely parameters that fit into your postulated theoretical model, then yes, but who says we want that? Even more: 2) Who says that the theoretical model (the linear one in your example) is correct at all? The point, at least to my understanding of the question, is whether big data can allow us to make as good or better decisions as "statistical inference". I discussed that in my answer. – Igor F. Dec 04 '20 at 12:09
  • @IgorF. Those are valid points and I think your answer is good (I upvoted it). But statistics is not just field in itself but it is also tool used by other fields. I think it is fair to say that most scientists who practice statistics, without necessarily being pure statisticians, want to find what most likely parameters of some theoretical relationship are. Also, linear model was just example so I can explain where I am coming from. In addition in answer above it is not my intent to claim that this makes statistics always necessary- I am trying to make claim it is still important in general – 1muflon1 Dec 04 '20 at 12:15