For (a) I wound up using logistic regression to determine the weight coefficients. In real life, tree-based approaches might be more sensitive to the strengths of different predictors in different parts of the feature space, but I didn't impose that structure on the fake data.
train_rows <- sample.int(m, size = .5 * m)
dt[-train_rows, cor(y, x_1)] # ~0.4
fit_lm <- glm(y ~ . + 0, data=dt[train_rows]) # simple linear regression
summary(fit_lm) # What does negative AIC mean here?
cor(dt[-train_rows, y], # ~0.6, better than single predictor, but...
predict(fit_lm, newdata=dt[-train_rows], type='response'))
# ... logistic regression seems more appropriate here
fit_glm <- glm(y ~ ., family = binomial, data=dt[train_rows])
summary(fit_glm)
cor(dt[-train_rows, y], # >0.9, this works well
predict(fit_glm, newdata=dt[-train_rows], type='response'))
I'm still a little stuck on (b).
I found another question on relative importance in logistic regression suggesting caret's varImp, which I believe (based on the docs and messing around) is using a performance measure like AUC rather than a parametric approach like using $|t|$ from the glm.
library('caret'); varImp(fit_glm)
This approach wasn't terribly useful for me without a cutoff value for "not useful." Clearly x_5 is less important, but it's not obviously useless. (And because it's a composite of multiple predictors, it might not stand out on a correlation matrix.)
I also considered:
drop1(fit_glm, test='Chisq')
# Single term deletions
#
# Model:
# y ~ x_1 + x_2 + x_3 + x_4 + x_5
# Df Deviance AIC LRT Pr(>Chi)
# <none> 93.35 105.35
# x_1 1 281.98 291.98 188.630 <2e-16 ***
# x_2 1 276.93 286.93 183.580 <2e-16 ***
# x_3 1 368.13 378.13 274.773 <2e-16 ***
# x_4 1 324.85 334.85 231.498 <2e-16 ***
# x_5 1 93.95 103.95 0.592 0.4418
So I see that AIC goes down when dropping x_5, and the p=0.4418 seems to indicate that the model without x_5 is not significantly different from the one with it. So I would be inclined to keep the first four predictors only.
I'm just posting what I tried, but hoping the experts here can point out anything I overlooked.