In my reading of maximum likelihood estimation, they go through samples with KNOWN distributions (e.g. binomial, poisson, etc.). I wonder how can I connect to my knowledge of machine learning.
In machine learning, it's always the log likelihood to be maximized, usually in terms of P(data|theta). However, how we don't even know what's the underlying distribution of data (consider data can be some text or some arbitrary numerics), so can I say optimization algorithms are used to estimate the best parameter when the underlying population parameter is unknown?
But I just know optimization only works when the objective function satisfy some conditions that make it optimizable (i.e. have a global/local mini- or maxi-ma). Then how can we deal with the MLE when there's no suitable objective function and with unknown underlying probability distribution? Or this is not even a problem at all as it fails all the condition for gradient descent?
TIA.