What is the relationship between $Y$ and $X$ in this plot?

Question

What is the relationship between $Y$ and $X$ in the following plot? In my view there is negative linear relationship, But because we have a lot of outliers, the relationship is very weak. Am I right? I want to learn how can we explain scatterplots.

enter image description here

What is $X$?
What is $Y$?

What process do you produced outliers? What makes you think that they are not real measurements? What is the theory? — abaumann, Sep 07 '14 at 15:47
Thanks for your comment. I just see this plot in a book. Y is dependent variable and X is independent variable. There is no theory. it plotted a scatterplot to show the relationship of Y given x. And there is a question in the book that asks for if there is any relationship or not, Linear or nonlinear? Strong or weak? — PSS, Sep 07 '14 at 15:52
This is an exercise in tasseography. This is very popular among day traders, and they call it technical analysis. Basically, without knowing something about the nature of the data it's a fruitless exercise — Aksakal, Sep 10 '14 at 00:33
@Aksakal It depends on the purpose of the analysis. For drawing inferences or making predictions we may be on shaky ground here, but for describing a batch of data or--as stated in this question--learning how to explain (describe) scatterplots--there is a very great deal that can be done with this exercise. — whuber, Sep 10 '14 at 15:22
@whuber, the question was about "relationship". OP stated he wants to learn how to "explain" plots, not "describe". I see a big difference between explaining and describing. The latter belongs to exploratory analysis (EA), and the former is beyond EA, in my opinion. — Aksakal, Sep 10 '14 at 15:26
@Aksakal Statistical language usually understands "relationship" rather literally: as describing sets of tuples of numbers. For instance, a correlation coefficient describes a relationship. There is no implication about the genesis, nature, or causal associations among underlying variables. I agree with you that "explain" usually is understood in such a deeper sense, but because relationships are so heavily emphasized in the question, I think it only fair not to push the literal meaning of "explain" too far. Suggesting that describing scatterplots is just tea-leaf reading goes too far, IMHO. — whuber, Sep 10 '14 at 15:37
@whuber, I think you should make it clear to OP that what you offered is fine in the context of exploratory analysis, and not Ok in inferential context. I'm not sure that OP understands the distinction, otherwise he'd be more careful with the language in his question. — Aksakal, Sep 10 '14 at 15:42
@Aksakal Thank you for the suggestion. I could use a little more of your help, though, in implementing it. Are you saying my answer in this thread does not clearly indicate it is exploratory? That would be difficult to fathom, given that EDA is so prominently mentioned at the outset and that all the language and conclusions in it are consistently about the data, not about any underlying variables. Or are you suggesting that something within my previous comments might be misleading and I should provide a correction or clarification? — whuber, Sep 10 '14 at 15:47
My first reaction was that this data looks like service time data. Maybe server response times or something. hist(Y) looks lognormally distributed. — TH4454, Dec 29 '15 at 00:51

score 51 · Accepted Answer · edited Jun 11 '20 at 14:32

The question deals with several concepts: how to evaluate data given only in the form of a scatterplot, how to summarize a scatterplot, and whether (and to what degree) a relationship looks linear. Let's take them in order.

Evaluating graphical data

Use principles of exploratory data analysis (EDA). These (at least originally, when they were developed for pencil-and-paper use) emphasize simple, easy-to-compute, robust summaries of data. One of the very simplest kinds of summaries is based on positions within a set of numbers, such as the middle value, which describes a "typical" value. Middles are easy to estimate reliably from graphics.

Scatterplots exhibit pairs of numbers. The first of each pair (as plotted on the horizontal axis) gives a set of single numbers, which we could summarize separately.

In this particular scatterplot, the y-values appear to lie within two almost completely separate groups: the values above $60$ at the top and those equal to or less than $60$ at the bottom. (This impression is confirmed by drawing a histogram of the y-values, which is sharply bimodal, but that would be a lot of work at this stage.) I invite sceptics to squint at the scatterplot. When I do--using a large-radius, gamma-corrected Gaussian blur (that is, a standard rapid image processing result) of the dots in the scatterplot I see this:

The two groups--upper and lower--are pretty apparent. (The upper group is much lighter than the lower because it contains many fewer dots.)

Accordingly, let's summarize the groups of y-values separately. I will do that by drawing horizontal lines at the medians of the two groups. In order to emphasize the impression of the data and to show we're not doing any kind of computation, I have (a) removed all decorations like axes and gridlines and (b) blurred the points. Little information about the patterns in the data is lost by thus "squinting" at the graphic:

Similarly, I have attempted to mark the medians of the x-values with vertical line segments. In the upper group (red lines) you can check--by counting the blobs--that these lines do actually separate the group into two equal halves, both horizontally and vertically. In the lower group (blue lines) I have only visually estimated the positions without actually doing any counting.

Assessing Relationships: Regression

The points of intersection are the centers of the two groups. One excellent summary of the relationship among the x and y values would be to report these central positions. One would then want to supplement this summary by a description of how much the data are spread in each group--to the left and right, above and below--around their centers. For brevity, I won't do that here, but note that (roughly) the lengths of the line segments I have drawn reflect the overall spreads of each group.

Finally, I drew a (dashed) line connecting the two centers. This is a reasonable regression line. Is it a good description of the data? Certainly not: look how spread out the data are around this line. Is it even evidence of linearity? That's scarcely relevant because the linear description is so poor. Nevertheless, because that is the question before us, let's address it.

Evaluating Linearity

A relationship is linear in a statistical sense when either the y values vary in a balanced random fashion around a line or the x values are seen to vary in a balanced random fashion around a line (or both).

The former does not appear to be the case here: because the y values seem to fall into two groups, their variation is never going to look balanced in the sense of being roughly symmetrically distributed above or below the line. (That immediately rules out the possibility of dumping the data into a linear regression package and performing a least squares fit of y against x: the answers would not be relevant.)

What about variation in x? That is more plausible: at each height on the plot, the horizontal scatter of points around the dotted line is pretty balanced. The spread in this scatter seems to be a little bit greater at lower heights (low y values), but maybe that's because there are many more points there. (The more random data you have, the wider apart their extreme values will tend to be.)

Moreover, as we scan from top to bottom, there are no places where the horizontal scatter around the regression line is strongly unbalanced: that would be evidence of non-linearity. (Well, maybe around y=50 or so there may be too many large x values. This subtle effect could be taken as further evidence for breaking the data into two groups around the y=60 value.)

Conclusions

We have seen that

It makes sense to view x as a linear function of y plus some "nice" random variation.
It does not make sense to view y as a linear function of x plus random variation.
A regression line can be estimated by separating the data into a group of high y values and a group of low y values, finding the centers of both groups by using medians, and connecting those centers.
The resulting line has a downward slope, indicating a negative linear relationship.
There are no strong departures from linearity.
Nevertheless, because the spreads of the x-values around the line are still large (compared to the overall spread of the x-values to begin with), we would have to characterize this negative linear relationship as "very weak."
It might be more useful to describe the data as forming two oval-shaped clouds (one for y above 60 and another for lower values of y). Within each cloud there is little detectable relationship between x and y. The centers of the clouds are near (0.29, 90) and (0.38, 30). The clouds have comparable spreads, but the upper cloud has far fewer data than the lower one (maybe 20% as much).

Two of these conclusions confirm those made in the question itself that there is a weak negative relationship. The others supplement and support those conclusions.

One conclusion drawn in the question that does not seem to hold up is the assertion that there are "outliers." A more careful examination (as sketched below) will fail to turn up any individual points, or even small groups of points, that validly could be considered outlying. After sufficiently long analysis, one's attention might be drawn to the two points near the middle right or the one point at the lower left corner, but even these are not going to change one's assessment of the data very much, whether or not they are considered outlying.

Further Directions

Much more could be said. The next steps would be to assess the spreads of those clouds. The relationships between x and y within each of the two clouds could be evaluated separately, using the same techniques shown here. The slight asymmetry of the lower cloud (more data seem to appear at the smallest y values) could be evaluated and even adjusted by re-expressing the y values (a square root might work well). At this stage it would make sense to look for outlying data, because at this point the description would include information about typical data values as well as their spreads; outliers (by definition) would be too far from the middle to be explained in terms of the observed amount of spreading.

None of this work--which is quite quantitative--requires much more than finding middles of groups of data and doing some simple computations with them, and therefore can be done quickly and accurately even when the data are available only in graphical form. Every result reported here--including the quantitative values--could easily be found within a few seconds using a display system (such as hardcopy and a pencil :-)) which permits one to make light marks on top of the graphic.

+1, one thing I do wonder about from looking at the plot is if there is a floor effect. The y values seem more bunched up at the bottom than I would expect from simple bivariate normals. — gung - Reinstate Monica, Sep 07 '14 at 17:46
Thank you very much for the complete answer. I learned a lot just by asking this question :) — PSS, Sep 07 '14 at 18:24
Wow. I would never have seen those two groups and the resulting line. And I question it. — Russ Lenth, Sep 07 '14 at 21:03
@ whuber : would you please say your opinion about the scatterplot that I have posted in the following link, if it is linear or not, strong or weak? and compare regression line with lowess line. I appreciate your help. http://stats.stackexchange.com/questions/114397/what-is-the-effect-of-having-skewed-dependent-variable-on-scatterplot-result — PSS, Sep 07 '14 at 21:11
@Russ I am glad to hear that somebody questions this exploration, because no EDA is unique or dispositive. I have included another image to help you see what I see. I would like to invite you to post an answer that is equally or more parsimonious and as usefully descriptive. — whuber, Sep 07 '14 at 21:54
As humans we are extraordinarily inclined to find patterns, even ones that are not there. I think it is quite plausible to obtain a scatter plot like the one we have here with just two independent RVs, one of them skewed. I have no proof of that, and I have no alternative analysis to offer - other than one that says there is little or no relationship. Yes, it is possible that bimodality is present. If the process could be observed further, we could see what happens. I just think we need to be cautious and aware of our inclination to react to plausibly spurious patterns. — Russ Lenth, Sep 07 '14 at 22:36
I wonder if Y is vehicle eMPG and the top group are the electric cars... Or it could be some other ratio. — xan, Sep 08 '14 at 00:30
@Russ You are correct. Experience is needed to keep from reading too much into patterns. My experience says that with 150-200 points it is difficult randomly to obtain the strong bimodality I measured in the y-coordinates. Such experience can easily and quickly be supplemented nowadays by simulation: when you think you see a pattern, then (1) characterize it quantitatively and (2) look for it in random samples that are generated according to a simpler alternative hypothesis. If the pattern shows up very much, then you can blame your visual cortex, but otherwise you may have found something. — whuber, Sep 08 '14 at 14:11
(Contd.) The issues are basic. EDA seeks simple, parsimonious descriptions of data, recognizing there is no perfect one in any case--that's an an ideal and a guide to analysis. Its principles work very well in this situation. I found the y grouping by critically examining the very suggestion you later made: would it help to re-express the y-values as logarithms? The answer--because of the bimodality--was a clear no. Additionally, the impression of two groups of y-values holds up even when controlling for x. This example shows how EDA is not mere subjective human-derived pattern construction. — whuber, Sep 08 '14 at 14:16
After reviewing all the other answers--as well as yours, @Russ--and the chat dialog, it strikes me that you seem to be objecting to the least significant element of this analysis: my suggestion that "it may be useful" to point to two y groups. Whether or not you agree with that opinion, the remaining analysis still holds up. The robust exploratory methods I used (which amount to finding two typical points within a cloud and drawing the line between them) will still produce a reasonable estimate of the trend. None of the other conclusions depend on a finding of y bimodality (or not). — whuber, Sep 08 '14 at 14:46
@whuber - I'd like you to show us the residual plot for your reasonable estimate of the trend. Residuals versus fitted, separately by your groups, or with different symbols, as you prefer. — Russ Lenth, Sep 08 '14 at 17:25
@Russ When I did this analysis I had only the image--and in it I believe I can see the residuals just fine. Since you have the data, I invite you to look at the residuals for the two fits I made: (1) for x relative to the line through the points (0.29,90) and (0.38,30) (bullets 1 and 3 in the conclusions); and (2) for x relative to the constant value 0.29 when y>=60 and the constant value 0.38 when y<60 (bullet point 7). Fit (2) is practically as good, with homoscedastic residuals symmetrically distributed around 0. There are small trends in those residuals: in opposite directions! — whuber, Sep 08 '14 at 19:53
OK, I added the plot you requested to my answer, showing the slight trends for fit (1). It's troubling, isn't it, to see residuals varying over the range $(-200,+200)$ when the $Y$ values themselves vary over the range $(0,120)$ — Russ Lenth, Sep 08 '14 at 21:41
@Russ Thank you. That was not the residual plot I described--the roles of x and y are reversed. However, it is informative nonetheless. The heteroscedasticity is the most striking thing: it actually seems to lend support to the two-cluster hypothesis (which would make the heteroscedasticity disappear). Mind you, I'm agnostic about that hypothesis. Everything I have written here is in the original spirit of careful, robust description of the data. Any single curve as a description of these data is going to be crude and perhaps unsatisfactory. — whuber, Sep 08 '14 at 21:45
I knew it was whuber by looking at the first blurred plot. Good work, sir. Thought: A 2d nonparametric histogram might do a good job of bringing the message to the part of the brain that interprets pictures that are more than lines. — EngrStudent, Sep 13 '14 at 20:06
@Engr That is a good suggestion. I am trying to limit the analysis to what is readily available from the image of the scatterplot alone, because that is what the OP had available--not the raw data. With the data on hand there is a lot more that could be done and your suggestion would be one of the natural first steps. — whuber, Sep 13 '14 at 21:08
If you could remove the lines, and perform something like ifft2(fft2(a).*fft2(a)) then you might find a subset of the field is translated. It would create a second, lower, peak in the 2d cross-correlation plot of the images. It looks like the translation is about 1.5-2.0 grid spacings. — EngrStudent, Sep 14 '14 at 00:19

Alexis · Answer 2 · 2017-10-12T00:43:54.477

32

Let's have some fun!

First of all, I scraped the data off your graph.

Then I used a running line smoother to produce the black regression line below with the dashed 95% CI bands in gray. The graph below shows a span in the smooth of one half the data, although tighter spans revealed more or less precisely the same relationship. The slight change in slope around $X=0.4$ suggested a relationship that could be approximated using a linear model and adding linear hinge function of the slope of $X$ in a a nonlinear least squares regression (red line):

$$Y = \beta_{0} + \beta_{X}X + \beta_{\text{c}}\max\left(X-\theta,0\right) + \varepsilon$$

The coefficient estimates were:

$$Y = 50.9 -37.7X -26.74436\max\left(X-0.46,0\right)$$

I would note that while the redoubtable whuber asserts that there are no strong linear relationships, the deviation from the line $Y = 50.9 - 37.7X$ implied by the hinge term is on the same order as the slope of $X$ (i.e. 37.7), so I would respectfully disagree that we see no strong nonlinear relationship (i.e. Yes there are no strong relationships, but the nonlinear term is about as strong as the linear one).

Play time with data

Interpretation
(I have proceeded assuming that you are only interested in $Y$ as the dependent variable.) Values of $Y$ are very weakly predicted by $X$ (with an Adjusted-$R^{2}$=0.03). The association is approximately linear, with a slight decrease in slope at about 0.46. The residuals are somewhat skewed to the right, probably because the is a sharp lower bound on values of $Y$. Given the sample size $N=170$, I am inclined to tolerate violations of normality. More observations for values of $X>0.5$ would help nail down whether the change in slope is real, or is an artifact of decreased variance of $Y$ in that range.

Updating with the $\ln(Y)$ graph:

(The red line is simply a linear regression of ln(Y) on X.)

Updated with graph per Russ Lenth's suggestion.

In comments Russ Lenth wrote: "I just wonder if this holds up if you smooth $\log Y$ vs. $X$. The distribution of $Y$ is skewed right." This is quite a good suggestion, as the $\log Y$ transform versus $X$ also gives a slightly better fit that a line between $Y$ and $X$ with residuals that are more symmetrically distributed. However, both his suggested $\log(Y)$ and my linear hinge of $X$ share a preference for a relationship between (untransformed) $Y$ and $X$ that is not described by a straight line.

edited Oct 12 '17 at 00:43

answered Sep 07 '14 at 19:04

Alexis

29,850

Thanks a lot :) I appreciate your taking time and running line smoother on the scatterplot – PSS Sep 07 '14 at 20:48
1

I just wonder if this holds up if you smooth $\log Y$ vs. $X$. The distribution of $Y$ is skewed right, and I think a transformation that makes the distribution more symmetric won't also look a lot like the iconic null scatterplot. – Russ Lenth Sep 07 '14 at 21:40
1

@Russ It is classical that bimodal distributions can appear skew and suggest log transformations. But the y distribution here is indeed bimodal and a log is probably not a useful way to re-express it. When the two components are separated, the lower one is still positively skewed and a square root is about the right amount to transform it to obtain a symmetric distribution. The square root does not affect the symmetry of the upper group appreciably, indicating the root may be a good choice. However, that does not fix the bimodality--and therein lies the problem with any smooth of this type. – whuber Sep 07 '14 at 21:58
1

Alexis, in our answers we are both guilty of using "strong" in undefined ways. The sense in which I meant "weak" was hinted at in some of my phrasing, which was meant to indicate that the slope is small compared to the scatter in the y values. I don't think your analysis comes up with any different conclusion in that regard. I felt a need for caution because, accepting hypothetically that there might be merit to the mixture model for y, it appears that in the upper group there might actually be a weak positive relationship between x and y and no relationship in the lower group. – whuber Sep 07 '14 at 22:02
+1s all the way around! @whuber can you recommend techniques that formalize your eyeball investigation into the bimodality of $Y$? (Is that what finite mixture models are about?) – Alexis Sep 07 '14 at 22:23
@whuber I agree with the points about "strong". I like to push back against the common bias in practice of (to refurbish a line from John Cleese in The Meaning of Life) stampeding towards the assumption of a linear relationship. But of course: these are weak relationships. :) – Alexis Sep 07 '14 at 22:25
@RussLenth You are right on the log(Y) plot: more symmetrically distributed residuals, a relationship more reasonably inferred as linear, and about double the $R^{2}$. – Alexis Sep 08 '14 at 01:15
3

Alexis, Tukey's EDA book is full of them. For more techniques (of greater sophistication, with mathematical justification) see Hoaglin, Mosteller, & Tukey, Understanding Robust and Exploratory Data Analysis. – whuber Sep 08 '14 at 14:18
@Alexis, if you don't mind, how did you scrape the whole data? Using the automatic mode or the manual mode? – rivu Sep 10 '14 at 23:42
2

@rivu manual. Took 10 or 15 minutes tops. Placed each point initially with pointer, then precisely located it using the arrow keys. – Alexis Sep 10 '14 at 23:44

gung - Reinstate Monica · Answer 3 · 2018-06-06T00:53:09.153

Here's my 2¢ 1.5¢. To me the most prominent feature is that the data abruptly stop and 'bunch up' at the bottom of the range of Y. I do see the two (potential) 'clusters' and the general negative association, but the most salient features are the (potential) floor effect and the fact that the top, low-density cluster only extends across part of the range of X.

Because the 'clusters' are vaguely bivariate normal, a parametric normal mixture model may be interesting to try. Using @Alexis' data, I find that three clusters optimize the BIC. The high-density 'floor effect' is picked out as a third cluster. The code follows:

library(mclust)
dframe = read.table(url("http://doyenne.com/personal/files/data.csv"), header=T, sep=",")

mc = Mclust(dframe)
summary(mc)
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
#   Mclust VVI (diagonal, varying volume and shape) model with 3 components:
#   
#   log.likelihood   n df       BIC       ICL
#        -614.4713 170 14 -1300.844 -1338.715
# 
# Clustering table:
#  1  2  3 
# 72 72 26

enter image description here

Now, what shall we infer from this? I do not think that Mclust is merely human pattern recognition gone awry. (Whereas my read of the scatterplot may well be.) On the other hand, there is no question that this is post-hoc. I saw what I thought might be an interesting pattern and so decided to check it. The algorithm does find something, but then I only checked for what I thought might be there so my thumb is definitely on the scale. Sometimes it is possible devise a strategy to mitigate against this (see @whuber's excellent answer here), but I have no idea how to go about such a process in cases like this. As a result, I take these results with a lot of salt (I've done this sort of thing sufficiently often that someone is missing a whole shaker). It does give me some material to think about and discuss with my client when next we meet. What are these data? Does it make any sense that there could be a floor effect? Would it make sense that there could be different groups? How meaningful / surprising / interesting / important would it be if these were real? Do independent data exist / could we get them conveniently to perform an honest test of these possibilities? Etc.

+1 For pointing out how an exploratory analysis naturally leads to interesting questions. I wish I had emphasized that point more in my answer. Although I think it would be pushing things to believe (at this point) that there really are three distinct groups, the cluster results still present a valid way of seeing that there is a negative relationship between x and y and of summarizing that relationship. I am led to wonder to what extent automatic clustering could be a generally useful exploratory tool--provided we are not tempted to read too much into the results. — whuber, Sep 08 '14 at 14:52

score 15 · Answer 4 · edited Apr 13 '17 at 12:44

Let me describe what I see as soon as I look at it:

If we're interested in the conditional distribution of $y$ (which if often where interest focuses if we see $x$ as IV and $y$ as DV), then for $x\leq 0.5$ the conditional distribution of $Y|x$ appears bimodal with an upper group (between about 70 and 125, with mean a bit below 100) and a lower group (between 0 and about 70, with mean around 30 or so). Within each modal group, the relationship with $x$ is nearly flat. (See red and blue lines below drawn roughly where I guess some rough sense of location to be)

Then by looking at where those two groups are more or less dense in $X$, we can go on to say more:

For $x>0.5$ the upper group disappears completely, which makes the overall mean of $x$ fall, and below about 0.2, the lower group is much less dense than above it, making the overall average higher.

Between these two effects, it induces an apparent negative (but nonlinear) relationship between the two, as $E(Y|X=x)$ seems to be decreasing against $x$ but with a broad, mostly flat region in the center. (See purple dashed line)

enter image description here

No doubt it would be important to know what $Y$ and $X$ were, because then it might be clearer why the conditional distribution for $Y$ might be bimodal over much of its range (indeed, it might even become clear that there are indeed two groups, whose distributions in $X$ induce the apparent decreasing relationship in $Y|x$).

This what I saw based on purely "by-eye" inspection. With a bit of playing around in something like a basic image manipulation program (like the one I drew the lines with) we could start to figure out some more accurate numbers. If we digitize the data (which is pretty simple with decent tools, if sometimes a little tedious to get right), then we can undertake more sophisticated analyses of that sort of impression.

This kind of exploratory analysis can lead to some important questions (sometimes ones that surprise the person who has the data but has only shown a plot), but we must take some care over the extent to which our models are chosen by such inspections - if we apply models chosen on the basis of the appearance of a plot and then estimate those models on the same data, we'll tend to encounter the same problems we get when we use more formal model-selection and estimation on the same data. [This is not to deny the importance of exploratory analysis at all - it's just we must be careful of the consequences of doing it without regard to how we go about it. ]

Response to Russ' comments:

[later edit: To clarify -- I broadly agree with Russ' criticisms taken as a general precaution, and there's certainly some possibility I've seen more than is really there. I plan to come back and edit these into a more extensive commentary on spurious patterns we commonly identify by eye and ways we might start to avoid the worst of that. I believe I'll also be able to add some justification about why I think it's probably not just spurious in this specific case (e.g. via a regressogram or 0-order kernel smooth, though of course, absent more data to test against, there's only so far that can go; for example, if our sample is unrepresentative, even resampling only gets us so far.]

I completely agree we have a tendency to see spurious patterns; it's a point I make frequently both here and elsewhere.

One thing I suggest, for example, when looking at residual plots or Q-Q plots is to generate many plots where the situation is known (both as things should be and where assumptions don't hold) to get a clear idea how much pattern should be ignored.

Here's an example where a Q-Q plot is placed among 24 others (which satisfy the assumptions), in order for us to see how unusual the plot is. This kind of exercise is important because it helps us avoid fooling ourselves by interpreting every little wiggle, most of which will be simple noise.

I often point out that if you can change an impression by covering a few points, we may be relying on an impression generated by nothing more than noise.

[However, when it's apparent from many points rather than few, it's harder to maintain that it's not there.]

The displays in whuber's answer supports my impression, the Gaussian blur plot seems to pick up the same tendency to bimodality in $Y$.

When we don't have more data to check, we can at least look at whether the impression tends to survive resampling (bootstrap the bivariate distribution and see if it's nearly always still present), or other manipulations where the impression shouldn't be apparent if it's simple noise.

1) Here's one way to see if the apparent bimodality is more than just skewness plus noise - does it show up in a kernel density estimate? Is it still visible if we plot kernel density estimates under a variety of transformations? Here I transform it toward greater symmetry, at 85% of default bandwidth (since we're trying to identify a relatively small mode, and the default bandwidth is not optimized for that task):

enter image description here

The plots are $Y$, $\sqrt{Y}$ and $\log(Y)$. The vertical lines are at $68$, $\sqrt{68}$ and $\log(68)$. The bimodality is diminished, but still quite visible. Since it's very clear in the original KDE it seems to confirm it's there - and the second and third plots suggest its at least somewhat robust to transformation.

2) Here's another basic way to see if it's more than just "noise":

Step 1: perform clustering on Y

enter image description here

Step 2: Split into two groups on $X$, and cluster the two groups separately, and see if it's quite similar. If there's nothing going on the two halves shouldn't be expected to split all that much alike.

enter image description here

The points with dots were clustered differently from the "all in one set" cluster in the previous plot. I'll do some more later, but it seems like perhaps there really might be a horizontal "split" near that position.

I'm going to try a regressogram or Nadaraya-Watson estimator (both being local estimates of the regression function, $E(Y|x)$). I haven't generated either yet, but we'll see how they go. I'd probably exclude the very ends where there's little data.

3) Edit: Here's the regressogram, for bins of width 0.1 (excluding the very ends, as I suggested earlier):

enter image description here

This is entirely consistent with the original impression I had of the plot; it doesn't prove my reasoning was correct, but my conclusions arrived at the same result the regressogram does.

If what I saw in the plot - and the resulting reasoning - was spurious, I probably should not have succeeded at discerning $E(Y|x)$ like this.

(Next thing to try would be a Nadayara-Watson estimator. Then I might see how it goes under resampling if I have time.)

4) Later edit:

Nadarya-Watson, Gaussian kernel, bandwidth 0.15:

enter image description here

Again, this is surprisingly consistent with my initial impression. Here's The NW estimators based on ten bootstrap resamples:

enter image description here

The broad pattern is there, though a couple of the resamples don't as clearly follow the description based on the whole of the data. We see that the case of the level of the left is less certain than on the right - the level of noise (partly from few observations, partly from the wide spread) is such that it's less easy to claim the mean is really higher at the left than at the center.

My overall impression is that I probably wasn't simply fooling myself, because the various aspects stand up moderately well to a variety of challenges (smoothing, transformation, splitting into subgroups, resampling) that would tend to obscure them if they were simply noise. On the other hand, the indications are that the effects, while broadly consistent with my initial impression, are relatively weak, and it may be too much to claim any real change in expectation moving from the left side to the center.

I questioned one answer, but this one I am confident in saying it is finding stuff that isn't there — Russ Lenth, Sep 08 '14 at 00:34
@RussLenth thanks for responding (I appreciate it). Is it the claim of bimodality in $Y|x$ you're saying is not there? Everything else I say pretty much follows directly from that. — Glen_b, Sep 08 '14 at 00:36
Well, I comment at greater length on whuber's answer. I just think this answer goes to even greater lengths to respond to patterns that are likely to be spurious. What I'm curious about is that it appears it could be a problem in a textbook, and I wonder if there is an answer in the back, or in the instructor's manual. — Russ Lenth, Sep 08 '14 at 00:42
I tried to reverse my down vote, but I guess I can't. Just because I really disagree with your answer doesn't necessarily mean it doesn't contribute to the discussion. I'm not sure how to use down-votes, and don't mean anything personal by it.p — Russ Lenth, Sep 08 '14 at 00:56
@Russ - This is not particularly specific about where my answer is actually flawed, but I've made my best guess -- my response grew too long, so I've moved it to the bottom of my answer. — Glen_b, Sep 08 '14 at 01:00
@Russ don't worry about the downvote, it really doesn't matter, outside of the fact that it signals there's something I should address. Much more important to get at why we disagree (to the extent that we do at all) than to worry about fake internet points. You have an objection worth discussing, and I'd gladly pay ten times that downvote to have even this brief discussion. I encourage you to downvote me every time you disagree, if you'll say why. That's my chance to learn something. — Glen_b, Sep 08 '14 at 01:01
@RussLenth you can undo a downvote (or upvote) by re-clicking the down vote. If you are unsure of where your votes are at the hovertext over the down (or up) arrow will let you know. — Alexis, Sep 08 '14 at 01:16
What clustering algorithm did you use? I didn't even bother w/ k-means, hierarchical, or DEBSCAN, as I thought they wouldn't work w/ this type of cluster. I jumped straight to GMM. — gung - Reinstate Monica, Sep 08 '14 at 12:31
@gung straight k-means with 2 groups, directly on Y (since that is the variable I said was "grouping") — Glen_b, Sep 08 '14 at 12:57
+1 I actually did a lot of this analysis but did not want to overly extend my answer with those results. You have done a great job in presenting it in a clear, readable, and convincing form. One thing I did in addition was to regress (actually, smooth) x against y (despite the characterization of y as "dependent"): I think the result was helpful in assessing nonlinearity in the relationship in a way that is agnostic about whether y should be treated as one or two groups. — whuber, Sep 08 '14 at 14:25
Another +1 for the kernel-density and (ahem) clustering investigation into bimodality. — Alexis, Sep 08 '14 at 14:51

Russ Lenth · Answer 5 · 2014-09-08T22:04:38.703

14

OK folks, I followed Alexis's lead and captured the data. Here is a plot of $\log y$ versus $x$. plot of log(Y) vs. X

And the correlations:

> cor.test(~ x + y, data = data)

    Pearson's product-moment correlation

data:  x and y
t = -2.6311, df = 169, p-value = 0.009298
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.33836844 -0.04977867
sample estimates:
       cor 
-0.1983692 

> cor.test(~ x + log(y), data = data)

    Pearson's product-moment correlation

data:  x and log(y)
t = -2.8901, df = 169, p-value = 0.004356
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.35551268 -0.06920015
sample estimates:
       cor 
-0.2170188

The correlation test does indicate a likely negative dependence. I remain unconvinced of any bimodality (but also unconvinced that it's absent).

[I removed a residual plot I had in an earlier version because I overlooked the point that @whuber was trying to predict $X|Y$.]

edited Sep 08 '14 at 22:04

answered Sep 08 '14 at 01:56

Russ Lenth

20,271

2

Incidentally... it just occurred to me that taking the log(Y) transformation as dependent is still equivalent to finding a non-linear relationship... the log(Y) is nicer wrt the residuals than the hinge function I played with in my answer... but one of the conclusions is similar: the relationship between $Y$ and $X$ has better functional expressions than $Y=a+bX$. – Alexis Sep 08 '14 at 03:20
Thank you for that residual plot, Russ. This is not a request, but I would like to point out that what I found to be of interest--and perhaps of greater value for exploring GoF--was the relationship of x as a function of y rather than this way around. Looking at the x residuals prompts some additional (maybe useful) questions not heretofore raised, such as whether we could learn something through non-linear re-expressions of x (yes, we can); whether much can be said regardless of the two-population hypothesis (yes, again), and about the robustness of my fit (it is very robust). – whuber Sep 08 '14 at 21:38
Well, maybe you want to do the residual plot for that. I'm moving on to other stuff. – Russ Lenth Sep 08 '14 at 21:45

Harvey Motulsky · Answer 6 · 2014-09-10T00:25:31.007

5

Russ Lenth wondered how the graph would look if the Y axis were logarithmic. Alexis scraped the data, so it is easy to plot on with a log axis:

enter image description here

On a log scale, there is no hint of bimodality or trend. Whether a log scale makes sense or not depends, of course, on the details of what the data represent. Similarly, whether it makes sense to think that the data represent sampling from two populations as whuber suggests depends on the details.

Addendum: Based on the comments below, here is a revised version:

enter image description here

edited Sep 10 '14 at 00:25

answered Sep 08 '14 at 02:00

Harvey Motulsky

20,456

I posted my graph within a few minutes of Russ Lenth posting his. I hadn't seen his, or I wouldn't have posted mine. – Harvey Motulsky Sep 08 '14 at 02:02
I find that in estimation the (straight up linear) regression results are stronger with log($Y$). – Alexis Sep 08 '14 at 03:14
9

This graphic presents an interesting example of the effect of a poor choice of visualization: by shrinking the aspect ratio and extending the y-axis more than twice as far as it need be, the software has automatically suppressed the visual impression of any vertical scatter, making it difficult for the viewer to see much of anything. This is why a good exploration, although guided by graphical representation, must (a) use suitable methods of visualization that reveal, rather than suppress, data behavior, and (b) support them with additional analyses (such as shown in @Glen_b's post). – whuber Sep 08 '14 at 14:51
For the ranges of Y in the question, log base 2 would be a simpler choice to have a reasonable range of values for the Y axis. It would also prevent the upper range from the nice values of 1 and 1,000 which do not conform to the data at hand. – Andy W Sep 09 '14 at 23:09

score 1 · Answer 7 · answered Sep 07 '14 at 15:51

1

Well, you are right, the relationship is weak, but not zero. I would guess positive. However, don't guess, just run a simple linear regression (OLS regression) and find out! There you will get a slope of xxx which tells you what the relationship is. And yes, you do have outliers that might bias the results. That can be dealt with. You could use Cook's distance or create a leverage plot to estimate the outliers' effect on the relationship.

Good luck

answered Sep 07 '14 at 15:51

Helgi Guðmundsson

447

What makes you think that they're real outliers rather than the DGP being non-linear? – abaumann Sep 07 '14 at 16:01
Well I suppose that might also be the case. But it's hard to tell, the dots are so scattered. – Helgi Guðmundsson Sep 07 '14 at 17:34
Why assume linearity with OLS? Nonparametric regression FTW! :) – Alexis Sep 07 '14 at 19:06
1

@Alexis is correct in emphasizing that assumptions such as linearity must be justified, whether by domain theory or by model checking. However, I think the outright deletion of outliers without carefully considering why such values occured is a very common error in statistical analysis. – abaumann Sep 07 '14 at 21:09
Yes, outliers can't be deleted without a good justification, such as wrong value. But transformations can help adjust the distribution of value to a better fit, and reduce outliers. And yes I agree, I believe it is quite common do delete outliers without justifiable cause. – Helgi Guðmundsson Sep 07 '14 at 21:39
@Glen_b already linked to the .csv file in my answer (click "data"). – Alexis Sep 08 '14 at 00:48
@Alexis my apologies. I missed it. – Glen_b Sep 08 '14 at 01:10

score 1 · Answer 8 · answered Sep 07 '14 at 16:00

You already provided some intuition to your question by looking at the orientation of the X/Y data points and their dispersion. In short you're correct.

In formal terms orientation can be referred to as correlation sign and dispersion as variance. These two links will give you more information on how to interpret the linear relationship between two variables.

score 0 · Answer 9 · answered Sep 10 '14 at 16:16

This is a home work. So, the answer to your question is simple. Run a linear regression of Y on X, you'll get something like this:

    Coefficient Standard Er t Stat
C   53.14404163 6.522516463 8.147781908
X   -44.8798926 16.80565866 -2.670522684

So, the t-statistics is significant on X variable at 99% confidence. Hence, you can declare the variables as having some kind of relationship.

Is it linear? Add a variable X2 = (X-mean(X))^2, and regress again.

    Coefficient Stand Err   t Stat
C   53.46173893 6.58938281  8.11331508
X   -43.9503443 17.01532569 -2.582985779
X2  -44.601130  114.1461801 -0.390736951

The coefficient at X is still significant, but X2 is not. X2 represents nonlinearity. So, you declare that teh relationship appears to be linear.

The above was for a home work.

In real life, things are more complicated. Imagine, that this was the data on a class of students. Y - bench press in pounds, X - time in minutes of holding one's breath before the bench press. I'd ask for the gender of the students. Just for fun of it, let;s add another variable, Z, and let's say that Z=1 (girls) for all Y<60, and Z=0 (boys) when Y>=60. Run the regression with three variables:

    Coefficient Stand Error t Stat
C   92.93031357 3.877092841 23.969071
X   -6.55246715 8.977138488 -0.72990599
X2  -43.6291362 59.06955097 -0.738606194
Z   -63.3231270 2.960160265 -21.39179009

What happened?! The "relationship" between X and Y has disappeared! Oh, it seems that the relationship was spurious due to confounding variable, gender.

What is the moral of the story? You need to know what is the data to "explain" the "relationship", or even to establish it in the first place. In this case, the moment I'm told that the data on students' physical activity, I'll immediately ask for their gender, and will not even bother analyzing the data without getting the gender variable.

On the other hand, if you're asked to "describe" the scatter plot, then anything goes. Correlations, linear fits etc. For your home work, first two steps above should be sufficient: look at coefficient of X (relationship), then X^2 (linearity). Make sure you de-mean the X variable (subtract the mean).

What is the relationship between $Y$ and $X$ in this plot?

9 Answers9

Evaluating graphical data

Assessing Relationships: Regression

Evaluating Linearity

Conclusions

Further Directions

Linked