I don't really understand why we minimise a cost function for gradient descent. Why don't we try to have something like a gradient 'climb', where we maximise some function?
Is it due to convention, or are there some properties which make minimising much better for optimisation than maximising functions?
A similar question was asked here, but I don't feel like the answers really address my question directly, and in a way which I understand.