Very briefly, we can turn any deterministic model into a statistical model by making some assumptions about the distribution of errors (which requires thinking carefully about the causes of those errors) and that guides us towards certain metrics and even statistical tests. However, this is a very involved, highly iterative process. Some examples are in order.
Example 1: Statistics Motivated by the Underlying Physics
For particle physics experiments, such as the search for the Higgs Boson, experimental data is collected by partial detectors placed at different angles around the collision. Particle detector data is a count of events where the probability of an event is independent of the time since the last event. Therefore we know that the count data follow a Poisson distribution. Our theory makes a prediction for what kinds of events we will detect at which angles, which gives us the mean number of events we would expect at each detector. We can tell if the experiment agrees with theory by asking if the observed count falls within the confidence interval for a Poisson distribution centered on our theoretical prediction.
Example 2: Using Statistics to Model Measurement Error
If we measure some quantity with an instrument, say using a scale to measure the weight of a precipitate after some chemical reaction, the source of error is "measurement error". Measurement error is often normally distributed, perhaps because it is the accumulation of many other small discrepancies that accumulated through-out the experiment, such as small variations in the amount of the reagents or the exact temperature of the reaction. Measurement errors are also very likely to be independent from measurement to measurement (although there also techniques for dealing with repeated measures.) Under these assumptions, the difference between our theoretical predictions and observations will be normally distributed and we can use statistics like Mean Square Error (MSE) to quantify the agreement between theory and experiment.
Example 3: Using Statistical Modeling to Create Empirical Models
We often have the case where errors are not so well behaved. Consider the case of the ballistic trajectory of a bullet modeled by a simple parabolic trajectory. There is measurement error, certainly, but this is dwarfed by other sources: wind speed, the Coriolis effect, air friction. If you plot observed vs. theoretical predictions, you will see two lines that start to diverge more an more over time. The residuals - the difference between observed and predictions - will not be independent and normally distributed with mean zero, but will be biased (mean of error not equal to zero), correlated (not independent), heteroscedastic (variance changes over time), and often skewed (not normally distributed.) There are three options: 1) introduce a statistical model that models these complications, 2) add atheoretical empirical corrections to the model and leave them to be explained later, or 3) eschew formal statistical testing for the time being and report a rougher-and-ready metric.
Guidelines
So, where does that leave you? It depends on where you are in the process. Here is a rough roadmap:
- Establish qualitative agreement (what the Wikipedia article calls "face validity.") This can be as simple are noting that the model is directionally correct: when one quantity goes up, so does the other.
- Check that your model is unbiased. By verifying that the mean of the difference between actual and predicted values $\sum_{i=1}^n (y_i -\hat{y})_i$ is small. Very little can be said about badly biased models and in particular rough-and-ready metrics like Mean Square Error (MSE) or Mean Absolute Deviation (MAD) make little sense.
- Assume a very simple model such as additive, normally distributed errors. and verify the assumptions of normality, homoscedasticity, etc. If this works, great! We're in case two with mostly well behaved measurement error to worry about. You can use the model to report goodness-of-fit, confidence intervals, and p-values.
- Either give up on statistical inference and report MSE, MAD, or similar metrics in a rough-and-ready way, or
- starting adding empirical terms to the model and performing model selection on them. For example, for the ballistic example, we might add a cubic correction and verify this reduced AIC by a certain amount. This generally requires considerably more data but if it works then you have an empirical fit to the data for theory to later explained. For example, Plank's law was known empirically by fitting data before any theoretical explanation could be offered.
By far the most commonly used statistical model used for Example 2 / Step 3 is the ordinary linear model. It is a common misconception that "linear" means "linear relationship between independent and dependent variables" when it really means "linear in the parameters". For example, we could model a non-linear relationship between volume and length like so $V = L^3 + \epsilon , \epsilon \sim \mathcal{N}(0, \sigma^2)$ and this is a linear model. It is also not necessary to fit a linear model to data - many metrics like MSE or log-likelihood can still be used even if the model was not fit. (Other metrics, such as the z-scores/p-values for individual parameters reported by most statistical software, do rely on the model being fit, and those you can't use.)
I know the above answer is kind of high-level, but it's difficult to do justice to it in this format. Oberkampf has written a couple of papers on this: A more introductory one that introduces concepts and a more quantitative one which discusses the statistics of validation metrics. This second paper contains all the detail needed to flesh out step (3) above, for example. Wikipedia also has a very brief introduction to the question.