We are trying to use SelectKBest F_Regression scoring function on a pool of 1000 numerical features, and solve a regression problem. Also, we wanted to paralellize the execution of SelectKBest and we succeeded too in doing so, as i understand that f_regression execution is a UNIVARIATE APPROACH.
But, the major challenge is to understand exactly how SelectKBest is able to compute the f_regression score. I get that part, that it is trying to perform ANOVA and then get a F_Value, but is the computed F-Value denotes the f_regression_score calculated by SelectKBest?.
I had a very extensive look at the mentioned article, and tried to create a function executing the same formula in order to match the result, but the computed F-Score doesn't matches what the SelectKBest's f_regression_score gives.
x_bar = float((sum(tag1.iloc[:,0].values) + sum(target.iloc[:,0].values))/(len(tag1) + len(target)))
tag1_mean = tag1.mean()
target_mean = target.mean()
n = len(tag1)
m = len(target)
numerator = n(tag1_mean - x_bar) + m(target_mean - x_bar)
ssw_tag = np.sum(((tag1-tag1_mean)2))
ssw_target = np.sum(((target-target_mean)2))
degree_freedom = n-1+m-1
denominator = (ssw_tag + ssw_target)/degree_freedom
F_Val = numerator/denominator
F_Val
The sample dataset i am using is something like this:
| F1 | F2 | Target |
|---|---|---|
| 2137.03417969 | 2247.9690 | 343.7083 |
| 2202.64135742 | 2249.1404 | 343.7735 |
| 2147.74707031 | 2243.414 | 343.9496 |
| 2131.01513672 | 2249.7673 | 344.0170 |
| 2177.02587891 | 2242.8867 | 343.9583 |
| 2202.58325195 | 2242.8474 | 343.8483 |
| 2163.75610352 | 2248.8467 | 343.6372 |
| 2138.95410156 | 2251.7893 | 343.5075 |
| 2246.29736328 | 2248.6138 | 343.4942 |
| 2235.34008789 | 2247.8184 | 343.5491 |
| 2162.52905273 | 2257.0894 | 343.6237 |
My requirement is to match the f_regression scores for each feature with the score obtained from SelectKBest.