The way p-values are typically used may not make sense to you because the way p-values are typically used does not make sense. At least, that is what some statisticians would say, and I am at least sympathetic to this critique.
First, let's assume a case that is purely descriptive. Simplifying your example a bit, say we have math test scores for a certain set of students and we want to know if female scores are higher than male scores for these students. Well, we can just look at the averages, and there's your answer. No need for a p-value. In fact, in this context, a p-value doesn't really have any meaning.
Now assume that this set of students is a random sample from a larger population, and we want to know if the female math scores are higher than the male math scores for this entire population--not just the subset we sampled. Then a p-value makes sense because it quantifies the sampling uncertainty around our estimate. Just because the female scores are higher in our sample, doesn't mean they're higher in the population.
Instead, imagine that our set of students isn't a random sample, but a convenience sample--and I think this is the sort of cases you're seeing. Then the above logic doesn't follow. Instead, when people report a p-value for this sort of situation, they are implicitly appealing to the concept of a super-population. Basically, they're asking you to imagine that the set of students is drawn from an imaginary (possibly infinite) population of students similar to the actual students in your sample. For some people, this thought experiment is quite convincing, and they can probably describe it better than I. In any case, the assumption here is that you really want to know whether female scores in this super-population are higher than male scores. The p-value attempts to quantity your uncertainty in inferring from the set of students you observe to this super-population.
To quote the late David Freedman,
Samples of convenience are often analyzed as if they were simple ran-
dom samples from some large, poorly-defined parent population. This un-
supported assumption is sometimes called the “super-population model.”
The frequency with which the assumption has been made in the past does
not provide any justification for making it again, and neither does the
grandiloquent name.
He goes on to say,
An SE for a convenience sample is best viewed as a de minimis error
estimate: if this were—contrary to fact—a simple random sample, the
uncertainty due to randomness would be something like the SE.
See a link here. For a more lengthy discussion from Berk and Freedman, see here. A representative quote form this article, goes like this:
As we shall explain below, researchers may find themselves assuming that their sample is a random sample from an imaginary population. Such a population has no empirical existence, but is defined in an essentially circular way—as that population from which the sample may be assumed to be randomly drawn. At the risk of the obvious, inferences to imaginary populations are also imaginary.
Again, while Freedmen (and I) don't necessarily find the notion of a super-population convincing, many do, and it's widely used.
Finally, p-values can also make sense in the context of causal inference. For the simple case of a randomized experiment, you typically want to know the average treatment effect on the participants. However, to know this, you'd need to see each participant treated and not treated. Instead, you only see say half treated and the other half untreated. You take this difference as your estimate of the average treatment effect but you still have uncertainty here because if slightly different set of participants had ended up in the treatment group (instead of the control group), you estimate would be a little different. The uncertainty here comes from random assignment instead of random sampling but works out to be about the same thing and the p-value helps quantify this uncertainty.
Causal inference for quasi-experimental (observational) studies can also be conducted with the attempt to estimate what would have happened if you had had a randomized experiment. So here the p-value can quantify uncertainty about this inference.