0

**Edit: (10/26/13) More clear (hopefully) mini-rewrites added at the bottom**

I'm asking this from a theoretical/general standpoint - not one that applies to a specific use case.

I was thinking about this today:

Assuming the data does not contain any measurement errors, if you're looking at a specific observation in your data and one of the measurements you recorded contains what could be considered an outlier, does this increase the probability (above that of the rest of the observations that do not contained measured outliers) that the same observation will contain another outlier in another measurement?

For my answer I'm looking for some sort of theorem, principle, etc. that states what I'm trying to communicate here much more elegantly. For clearer explanations see Gino and Behacad's answers.

Example:

Let's say you're measuring the height and circumference of a certain type of plant. Each observation corresponds to 1 plant (you're only doing this once).

For height, you measure:

Obs 1  |   10 cm
Obs 2  |   9 cm
Obs 3  |   11 cm
Obs 4  |   22 cm
Obs 5  |   10 cm
Obs 6  |   9 cm
Obs 7  |   11 cm
Obs 8  |   10 cm
Obs 9  |   11 cm
Obs 10  |   9 cm
Obs 11  |   11 cm
Obs 12  |   10 cm
Obs 13  |   9 cm
Obs 14  |   10 cm

Since observation 4 contains what could be considered an outlier from the rest of the data, would the probability increase that measured circumference also contains an outlier for observation #4?

I understand my example may be too idealistic but I think it gets the point across...just change the measurements to anything.

Edited in attempts to make more clear:

(10/26/13)

Version 1 Attempt of Abbreviated Question:

In nature and in general, is there a tendency (even a weak one) that the probability is greater that the "degree of variance from the mean in any attribute(s) of an observation" will be similar to the "degree of variance from the mean in any other* specific attribute of that same observation" in comparison to the probability that it will INSTEAD be more similar to the "degree of variance from the mean in that same* specific attribute of any other observation."

* next to a word means I was pairing what they reference.
"Quotes" used above mean nothing and are used simply to help section parts together/off for clarity.

Version 2 Attempt of Abbreviated Question:

In nature and in general, is variance from the mean across observations for one attribute¹ correlated (even with extremely loose correlation) to the variance from the mean across observations for all attributes¹?

¹Attribute meaning measurement, quality, presence-of-either of these, and/or nearly anything else that the word "attribute" could even slightly represent as a word. Include all synonyms of the word "attribute" as well.

Taal
  • 315
  • 3
    Unless you stipulate a definite probability model for your data and also define--in a specific and operational way--what "outlier" might mean, this question will not be answerable. – whuber Oct 23 '13 at 18:25
  • I did hesitate to use the word "outlier" as I didn't want to really throw a label (and one that is somewhat subjective) onto what I'm trying to explain - and especially one that may mislead. When you say a "definite probability model" could you elaborate? I'm open to any critiques of how I asked the question as I don't feel I did the best job explaining what I'm trying to ask. – Taal Oct 23 '13 at 18:31
  • 1
    If the variables are correlated in the tails, then you will have such tendency. However, since outliers are rare, there is no realistic empirical way to show such 'tail dependency'. – Michael M Oct 23 '13 at 18:43
  • 4
    I hardly know where to begin. Let's start with getting some context for the question and understanding the meanings of the terms you use. Precisely what do you mean by the "probability" of an outlier? Are you supposing a sequential experiment in which the chance of an "outlier" changes over time? If so, what is your model for this? Do you perhaps mean your estimated probability of an outlier? If the latter, what was your initial estimated probability (or basis to assess a change in probability)? How are you estimating the probability? How are you identifying "outliers"? – whuber Oct 23 '13 at 19:39
  • @whuber Whuber, I tried to answer each of your questions the best I can in the following text. Each of my answers is paired with a # I assigned to your question. The #s were assigned sequentially. – Taal Oct 23 '13 at 22:47
  • First off, it may be best to think of this in terms of "amount of variance" - Behacad's answer may help. – Taal Oct 23 '13 at 22:47
  • Answers:
    1. The likelihood that a specific measurement for a specific observation would be considered an outlier. Again, it may be best to think of this in terms of "amount of variance" instead of "outlier".

    2. We can add in the dimension of time, but I'm moreso looking for a general type of response that could be applicable to nearly any situation whether it be sequential or not.

    3. See previous answer, I'm not sure it matters if it's sequential or not as I'm looking for a more general response.

    – Taal Oct 23 '13 at 22:48
  • Well it's more a comparison. However, you could assume that we have not actually determined a specific probability yet and thus it would be estimated. 5) It should not matter I would think as I'm only interested in the comparison of the probability of (Group A, Observations that contain an outlier & Group B, Observations that do not contain an outlier). 6) I was probably using the term "outliers" too loosely...the other answers may help clarify. Once again, I'm interested in the comparison of probability (is it safe to assume its greater than the other?) - not an exact quantification.
  • – Taal Oct 23 '13 at 22:49
  • I answered the last two questions as 1 question by the way. Also, thanks for your input whuber! – Taal Oct 23 '13 at 22:49
  • 2
    If the variables are independent, then no. If they're dependent, it depends on the form of dependence and how that interacts with your definition of what makes an outlier. – Glen_b Oct 24 '13 at 00:39
  • @Glen_b The main theme of the question is essentially that all variables are dependent to -some degree- no matter how small that degree is. Kind of like the "a butterfly causing a hurricane" zen outlook. In reality, I feel like categorizing things as such (independent, dependent) is actually, when you think deeply about it, somewhat of a misnomer...but the assumption is necessary in order to perform certain desired calculations. – Taal Oct 24 '13 at 01:06
  • 2
    If they're dependent, it's almost certain to change the probability, but it may be up or down and it may be a lot or a little; we can't say much of anything in the general case. – Glen_b Oct 24 '13 at 01:10
  • @Glen_b yes, if they're "dependent"...I may not be communicating my point very well. The two other answers may help to clarify. – Taal Oct 24 '13 at 01:13
  • 1
    It's perfectly possible to have a form of dependence that doesn't change the probability of an outlier. It's perfectly possible to have a form of dependence that doesn't change the variance. – Glen_b Oct 27 '13 at 02:06
  • @Glen_b Exactly. However, I already admit in my question all outcomes are obviously ___possible____. However, I'm wondering if there is something in the way nature/the universe works whereby an observation containing a measurement/etc. that is an "outlier"/"has-a-certain-amount-of variance-away-from-the-mean-to-where-it-could-potentially-'stand-out'" (this is obviously subjective...but shouldn't matter) has a higher probability (IN GENERAL - and the keyword here is "probability") of having another measurement/quality that also shares this same "attribute" in comparison to obsv that don't. – Taal Oct 27 '13 at 06:14
  • @glen_b I attempted to add some better explanations at the bottom of my question, let me know if they help at all – Taal Oct 27 '13 at 06:48
  • It's not really possible to generalize in such a way (outside of something like 'sometimes'); it would require us to have a special kind of highly comprehensive knowledge that we don't possess. (As Dr Lanning's hologram said in a movie, "You must ask the right questions.") – Glen_b Oct 27 '13 at 06:50
  • @Glen_b 1) It's not possible to prove this generalization as a mathematical law: I (at least now) completely agree. It's not "possible to generalize in such a way": I disagree - one can choose to throw labels onto an idea/concept and believe it is true despite the ability for said idea/concept to be empirically proven. I know what I'm asking is nearly (if not completely) impossible to prove, but this is StackExchange (most intelligent people I've ever encountered for the section they mostly represent) - so I had a glimmer of hope. Just pointing me to any relevant text is welcome too. – Taal Oct 27 '13 at 07:02
  • Glen, also, if you have any advice on how to make my question more clear then I'm more than happy to incorporate it in...admittedly I'm not the best at communicating things and thus may need help following your advice "You must ask the right questions." without actually changing what I'm asking. – Taal Oct 27 '13 at 07:06
  • 1
    I can't tell what you really need to ask. If I had better advice, you can be completely certain I'd let you know. And your assertion in your first response of the two above this one (the thing that you believe to be true) -- you're welcome to provide evidence of that, or some logical argument for it. In fact, I recommend you do, because it will likely reveal something important (such as an unstated assumption you have). – Glen_b Oct 27 '13 at 07:54
  • @glen_b I believe both/all parts of the comment/assertion to be...well true...- but that is only "me." Otherwise I wouldn't of written them :) I think you are referencing the part where I say ".... It's not 'possible to generalize in such a way': I disagree - one can...." though. In this, I'm not necessarily stating a mathematical/logical argument here...but more-so a semantic one, ha. One can generalize (you could use the word believe as a synonym here) anything they want. It's just that not everything everyone "generalizes" (or believes) is true. – Taal Oct 27 '13 at 08:27
  • ^Essentially I was trying to illustrate that because you (even though you have massive amount of credibility) say that this cannot be proven - or even if it actually is completely and irrevocably impossible to prove - it does not actually affect whether or not it actually is true. It merely affects our perceived likelihood that it is true (not to get too deep, but not even the simplest "laws" of science are 100%). It's the same concept as having faith in religion. It is not proven, but people still believe as if it were true, because no one can really and empirically prove otherwise. – Taal Oct 27 '13 at 08:29
  • ^Lol have to clarify that above and I'm getting way off topic. Basically, anything that humans believe is always true (even if it's gravity) is only our perception of such. However when you solidly prove something (like π, gravity, or the Pythagorean theorum) the likelihood it is true may as well be 100% and its easier/makes more sense to just assume it is than taking into consideration the .00000001% chance it isn't. Back on topic though, I obviously can't prove it to be true or not true..but I keep feeling the tendency exists. I'm looking for any evidence that tilts my opinion either way. – Taal Oct 27 '13 at 08:44
  • I suggest we take it up in Ten Fold – Glen_b Oct 27 '13 at 10:16
  • 1
    The new questions are still puzzling. Version 2 looks most like "are variables are general correlated?" to which an empirical answer is simply yes. But the precise sense remains elusive. "Variance" and "correlation" have existing statistical meanings as collective properties of samples or populations; Taal seems to be using them to refer to individual observations and/or as if variance were a word that should just be read as another word for variation. Regarding all definitions as arguable much reduces the scope for definite statements. We still need a definition for probability. – Nick Cox Oct 27 '13 at 11:29
  • 1
    Should be "are variables in general correlated?" – Nick Cox Oct 27 '13 at 11:36
  • @NickCox Yes, admittedly version 2 was pretty bad and I worried someone would interpret it exactly as you just did...which really is my fault. What did you think of the first version?...or I'm assuming that's where your request for clarification on my "taal jargon" comes from, heh - although I don't feel my use of the words is too off. I will try to correct this and further clarify - join ten fold. I will say though that my use of the word "probability" means (as defined by Google) "the extent to which something is probable; the likelihood of something happening or being the case." ..more.. – Taal Oct 27 '13 at 11:37
  • As for an exact probability (or likelihood) it doesnt matter - all I'm interested in is if the probability is greater for an observation that has a data point that (i'll try some different words) is "uncommon in degree or intensity in comparison to other data points of the measurement for all observations". I can't quantify this - as it doesn't actually matter. If an observation has a data point with this quality, is there a greater chance (ill quantify "chance" here arbitrary as 4:10 odds) that it will have other data points with the same quality in comparison to obsv with datap without... – Taal Oct 27 '13 at 11:43
  • ^...this quality. So observations with a datapoint with this quality have a (completely made up quantifcation for example) 4 times out of 10 tries chance of having another datapoint from another of any measurement with this "outlier-ish" quality in comparison to observations that don't have any datapoints that are "uncommon in degree or intensity in comparison to other data points of the measurement for all observations." So for these "normal" observations with "average" (loosely used) datapoints for seemingly all of their datapoints their chances of having a datapoint is 3 out of 10.... – Taal Oct 27 '13 at 11:46
  • ^...assume you looked at all datapoints except for one just for this example's and explanation of my conjecture's sake. – Taal Oct 27 '13 at 11:49
  • On CV we can try to deal with definite technical questions. You seem to want to ask much looser philosophical questions. CV isn't a good forum for that. I can't see that we are going to make things much clearer for you. The definition of probability as the extent to which something is probable unfortunately does not help. People want you to give a formal statement defining probability mathematically for your example; if you don't want to do that, discussion here remains very difficult. – Nick Cox Oct 27 '13 at 11:49
  • @NickCox I'm only looking for this: "Is this probability greater than that probability?" I'll join ten fold in a second - we can chat there instead of me spamming comments here...heh. – Taal Oct 27 '13 at 11:50
  • Overlapping here, but the same point about definition of probability applies to invocations of "chance". I don't see that I am helping you, but I tried. – Nick Cox Oct 27 '13 at 11:51
  • @NickCox Lol, lets just chat in Ten Fold - this question has been bugging my mind for like the past week...not that I'm assuming it will be solved...just a point in the right direction or things like your last comment which are helpful :) – Taal Oct 27 '13 at 11:53
  • Yeah, give me some time and I'll try to study more deeply the responses by you and glen in the comments and attempt a rewrite or (at) you/glen with a comment hopefully where I feel I've read enough and confidently feel we're on the same page. I do appreciate your input though. – Taal Oct 27 '13 at 12:52
  • 1
    My biggest question is: "outlier on what"? If we specify a certain field (e.g., health), then I think we can find an answer. If it is just in general, then perhaps not. There are literally an infinite amount of variables associated with any observation. Imagine a study in which you measure a person's intelligence, the length of their toenails, how many particles are in the underwear they are wearing, and the number of letters in the first name of their great great great great grandmother. Being an outlier on any of these might not mean they will be on anything else...Thoughts? – Behacad Oct 31 '13 at 15:50
  • @Behacad #1 I feel very alone now Behacad! #2 You fell into the example trap #3 How does one specify what is contained in a certain field?...that's assuming somewhat obvious dependency first - I am assuming all things are dependent on some level (and this is, yes very esoteric) #4 This is almost like a philosophical question - perhaps I should ask it over there? #5 I view this as a tendency - meaning it won't always happen...but a pattern is recognizable if you're able to zoom out enough. #6 Also interested in other thoughts you have tho as you're the only other one who understands the most. – Taal Nov 04 '13 at 07:45