The approach I would take would be to have two (or possibly three) outputs, one of which predicts the probability of there being a transaction and the other of which predicts the conditional mean (and perhaps conditional variance) of the value of the transaction should there be one. It is relatively straightforward to construct a likelihood for both tasks at the same time, and is an approach I have used successfully for modelling rainfall (based on the work of Peter Williams).
Peter M. Williams, "Modelling seasonality and trends in daily rainfall data", NIPS'97: Proceedings of the 10th International Conference on Neural Information Processing Systems, December 1997 Pages 985–991 (www)
That approach has the advantage of not needing any resampling, which is often of dubious value as it means the training set is no longer statistically representative of operational conditions and is likely to over-predict transaction value unless some other correction is made.
In these sort of situation is is vital to understand what performance criterion is important for operational use and why, as otherwise it is easy to be led astray pursuing performance statistics that don't measure what is really important, and end up with a system that looks good on paper but doesn't actually do anything useful. This is very common with imbalanced classification problems, where optimal accuracy is often achieved by ignoring the minority class. If this is not acceptable it means that false negatives are a "worse" error than false positives, which means accuracy is not the right performance metric and you should be using something like expected loss (effectively a cost-weighted accuracy). The imbalance isn't a problem, the problem is using a performance metric that doesn't measure what you really want to measure.