0

I'm carrying out a regression problem where I am trying to predict quality based on other attributes of wine. (The quality data is the result of the median of 3 wine tasting experts rating each wine out of 10.

My problem: After carrying out a linear regression using sklearn, my coefficient of determination for the alcohol variable was only 0.2. To improve this:

  • I have tried using multiple linear regression with several other variables (volatile acidity, density etc.) but at most can only get a correlation as high as 0.27.
  • I have tried standardising and removing outliers. The steps I took to do this were a) finding the natural log b) finding the z-score c) removing those outside 1.5*IQR (Tukey's).

I've attached representations of the data for context.

  1. Am I using an appropriate algorithm? (based on the images attached)
  2. If linear regression is the most appropriate algorithm, how can I improve the results?

This study is part of a challenge where they have specifically asked to predict wine quality so I believe a stronger correlation should be possible.

correlation heat map

variable correlations against quality part 1

variable correlations against quality part 2

Above I have attached two images - one for the correlation heat map and the other two are scatter plots of quality against other variables.

Jonny
  • 11

0 Answers0