To resolve the issue of bias, note that, when we consider the effect size in a test that rejects, we no longer consider the entire distribution of $\hat\theta$ that estimates $\theta$ but $\hat\theta\vert\text{reject }H_0$, and there is no reason to expect this latter distribution to have the unbiasedness that $\hat\theta$ has.
Regarding the issue of being "underpowered", it is true that a formal definition of this term would be nice. Note, however, that as power increases, the estimation bias in estimates corresponding to rejected null hypotheses decreases.
library(pwr)
library(ggplot2)
set.seed(2022)
Ns <- seq(50, 2000, 50)
B <- 10000
powers <- biases <- ratio_biases <- rep(NA, length(Ns))
effect_size <- 0.1
for (i in 1:length(Ns)){
powers[i]<- pwr::pwr.t.test(
n = Ns[i],
d = effect_size,
type = "one.sample"
)$power
observed_sizes_conditional <- rep(NA, length(B))
for (j in 1:B){
x <- rnorm(Ns[i], effect_size, 1)
pval <- t.test(x)$p.value
if (pval <= 0.05){
observed_sizes_conditional[j] <- mean(x)
}
observed_sizes_conditional <- observed_sizes_conditional[
which(is.na(observed_sizes_conditional) == F)
]
ratio_biases[i] <- mean(observed_sizes_conditional)/effect_size
biases[i] <- mean(observed_sizes_conditional) - effect_size
}
print(paste(i, "of", length(Ns)))
}
d1 <- data.frame(
Power = powers,
Bias = biases,
Statistic = "Standard Bias"
)
d2 <- data.frame(
Power = powers,
Bias = ratio_biases,
Statistic = "Ratio Bias"
)
d <- rbind(d1, d2)
ggplot(d, aes(x = Power, y = Bias, col = Statistic)) +
geom_line() +
geom_point() +
facet_grid(rows = vars(Statistic), scale = "free_y") +
theme_bw() + theme(legend.position="none")
I do not know the correct term for what I mean by "ratio bias", but I mean $\dfrac{\mathbb E[\hat\theta]}{\theta}$. Since the effect size is not zero, this fraction is defined.

This makes sense for the t-test, where the standard error will be larger for a smaller sample size (less power), requiring a larger observed effect to reach significance.
By showing this, we avoid that irritating issue of defining what an "underpowered" study means and just show that more power means less estimation bias. This explains what is happening in the linked question, where a reviewer asked an author for the power of the test in order to screen for gross bias in the conditional estimator $\hat\theta\vert\text{reject }H_0$. If the power is low, the graphs above suggest that the bias will be high, but high power makes the bias nearly vanish, hence the reviewer wanting high power.