Alternatives to scatterplot for visualisation for large samples

Question

I would like to visualise the following scatterplot in a different way that would make it more intuitive. The X axis is trading frequency and describes how many trades were conducted whereas the Yaxis describes the return on investment achieved with the transactions.

The sample is quite large and dominated by extreme values. Therefore I considered creating violin plots for each of the trading frequencies but that seemed a bit extreme.

Are there any other alternatives to this scatterplot that would allow me to visualise this more appealingly?

Since this is a "how to visualize my data" question, it would really benefit from including (a sample of) the data. — dipetkov, Sep 26 '22 at 17:52
An alternative to transforming the statistic on the y-axis, could be to re-formulate the statistic that you plot on the y-axis. I suspect that ROI * Freq might work well on the y-axis. — Sextus Empiricus, Sep 27 '22 at 10:00

mkt · Accepted Answer · 2022-09-27T06:31:47.573

9

I would begin by considering a transformation - very like to be useful for the Y-axis, and possibly for the X too. Log if there are no zeroes or negative values, square root if there are no negative values (zeroes are fine), cube root if there are negative values.

After transformation, you could consider:

A scatterplot with some jitter and possibly transparency (probably not great).
A violin or semi-violin plot.
A hexbin plot (quite likely the best option).

edited Sep 27 '22 at 06:31

answered Sep 26 '22 at 14:36

mkt

18,245
11
73
172

I transformed both x and y, y had negative values down to -100 so I added 100 before transforming. Is the plot then still interpretable? – magisterludi Sep 26 '22 at 15:07
5

@magisterludi Probably best to use a cube root transformation then. I've updated the answer to reflect this. – mkt Sep 26 '22 at 15:13

score 6 · Answer 2 · answered Sep 26 '22 at 14:37

6

You could indicate the density of points with a heatmap, e.g., using a black-body radiation palette. Alternatively, use a grayscale. Or use a hexbinplot, which pretty much does the grayscaling for you.

answered Sep 26 '22 at 14:37

Stephan Kolassa

123,354

Banach · Answer 3 · 2022-09-27T09:15:06.430

Binscatter plots are designed for this exact problem: visualizing two-way relationships in huge datasets. See here for R and python packages and some references.

The name says it all. Binscatter procedure partitions the data domain into bins and plots only sample averages in the bins rather than all data points.

Main advantage of binscatter is the ease with which it handles additional covariates. So, roughly speaking, visualizing n-way relationships. See here for an overview and some examples.

Alternatives to scatterplot for visualisation for large samples

3 Answers3