Questions tagged [large-data]

'Large data' refers to situations where the number of observations (data points) is so large that it necessitates changes in the way the data analyst thinks about or conducts the analysis. (Not to be confused with 'high dimensionality'.)

A sufficiently large number of observations for an analysis may require changes in the way the data analysis proceeds, or in the way it is understood.

Some examples where the process may need to be adapted are:

  • special strategies may be required if there are more data than can fit in a computer's memory,
  • the analyst may need to pay attention to the computational efficiency of different optimization algorithms,
  • consideration needs to be given to how to effectively visualize the data, if standard plots (e.g., a scatterplot) would just display a large black spot due to overlapping points.

A common example of a case where analysts conceptualize an aspect of the process differently concerns statistical significance. With sufficient data, any difference, no matter how trivial, will be 'significant'. This fact leads many analysts to view findings of significance differently than when smaller data sets are available.

562 questions
46
votes
10 answers

What exactly is Big Data?

I have been asked on several occasions the question: What is Big-Data? Both by students and my relatives that are picking up the buzz around statistics and ML. I found this CV-post. And I feel that I agree with the only answer there. The…
Gumeo
  • 3,711
6
votes
1 answer

Wherefore Big Data?

It's been many years since I've done any statistics (or any serious math), but I do remember that the sampling error decreases more slowly for larger sample sizes (like n^-1/2, at least for some statistics). I also remember (from numerical…
pron
  • 161
4
votes
1 answer

Dealing with Big Data and Lots of Variables

What is a good technique to use on data that has many categorical variables with many possible values? For example, let's say you are trying to determine what kind of people are more likely to purchase again from your online store and you have…
Micro
  • 253
3
votes
0 answers

Is this approach using the central-limit theorem applicable for reasoning about open source project data?

Definitions Open Source Project: In short (and roughly described, only to the purpose of clarification) open source projects allow me to have access to the code of a certain 'program'. Motivation and Background (so I can understand your answer) I…
1
vote
0 answers

Statistical tools for tick data

I am working with tick-data on some of the more liquid futures and options which gets millions of ticks every single day. Can I use the normal statistical techniques (like correlation, regression, cointegration, etc) to tick data due to the high…
nimbus3000
  • 207
  • 2
  • 8