1

I am interested in studying the effect of increasing data samples for a regression model on train error and test error. For this I have used 95% confidence intervals for different values of a sample data. I found something that i couldn't understand and couldn't find an explanation by looking up in the internet : The lower bound of the confidence interval of the test error stays constant by increasing the number of samples.

The x-axis is the number of samples and the y-axis is the error

The blue line is the lower bound of the confidence interval and the green one is the higher bound

enter image description here

Galen
  • 8,442
  • 2
    Your chart has two lines, both saying "Test" – Henry Nov 26 '21 at 00:32
  • 2
    Define what you mean by "error". What values do the y-axis tick marks represent? Is the blue line at error = zero? – user20637 Nov 27 '21 at 19:00
  • @Henry. OP says "The blue line is the lower bound of the confidence interval and the green one is the higher bound". I deduce that they both correspond to "test error". – user20637 Nov 27 '21 at 19:01
  • 2
    @user20637: that phrase was edited in after my comment. It answers my question, but raises new ones – Henry Nov 27 '21 at 19:06

1 Answers1

1

I'm squinting at the plot a little bit, but it appears that the lower bound is relatively constant compared to the upper bound, but maybe it isn't exactly constant. Here are a couple of separate options to check this further:

  1. Try using a log scale on the vertical axis.
  2. Try plotting in different panels with shared x-axis.
fig, axes = plt.subplots(2,1, sharex=True)
axes[0].plot(x, upper_bound_y)
axes[1].plot(x, lower_bound_y)

The above are leads for figuring out whether the lower bound is truly constant, or just relatively constant compared to the changes happening in the upper bound.

I imagine that you might like to know why the lower bound is at least relatively constant. Unfortunately I do not know precisely. I suspect it has to do with

  • the model already having been optimized to minimize error that leads to this asymmetry
  • combined with some bound on the predictive accuracy with the given model and data.
Galen
  • 8,442