Logging: safety vs performance. How do I choose?

Question

While making an application I've come to the point where I want to add logging for the inevitable case when something goes wrong. Now this seems like a problem that should have been solved decades ago, and indeed - there's no shortage of logging libraries for any decent programming language. But while trying to choose among them I ran across a rather fundamental dilemma.

Do I emphasize safety or performance?

Let's look at both extremes. If I emphasize safety, then I should make sure that the log entry is safely stored before I proceed with work. Otherwise I could end up in a situation where my program thinks it's logged 5 log entries and has done substantial work, but then an error makes these entries disappear and later forensics turn up nonsensical results. "I see that the program only logged up to point A, but there is already data stored which suggests it reached point B. Then where are the log entries between A and B?! What's going on here?!" At the furthest extremes of safety this means that after I produce a log entry I also need to wait until it has been successfully flushed to physical storage. However that's a VERY expensive operation and would basically kill my program's performance, since every log call would probably take tens if not hundreds of milliseconds.

On the other extreme - performance. In this case I need to shove the log entry to wherever as fast as I can and move on without looking back. Another background process then can take said entry and try to write it to physical storage, but the main process is unaffected. The main risk here is the one mentioned before - it's easy to lose log entries if everything crashes and burns before they could be written out.

And this whole thing is on a spectrum, of course. By employing various strategies you can make your chosen approach either safer (but slower) or faster (but riskier). And... I can't decide how to choose.

Is there some commonly accepted middle ground? Has anyone every come up with a way of deciding where your particular situation should lie?

Why would an error cause your log entries to "disappear"? Unless the error is something catastrophic, like out of memory or disk full, your program should be able to continue long enough to persist the logs. — John Wu, Mar 24 '20 at 15:47
@JohnWu As long as the data gets buffered somewhere before being written out, it's possible. If it's stuck in a buffer in your local process, then a simple process crash will do it. If it's stuck in a buffer at the OS level, then a power failure or a hardware failure (BSOD) could do it. Etc. — Vilx-, Mar 24 '20 at 16:26
Do you actually have a problem with processes crashing all over the place (to the point where you refer to them as "simple")? Or is this just hypothetical? Unless you are writing at a pretty low level there should always be some opportunity for your program to handle the error or exception before bailing out. — John Wu, Mar 24 '20 at 16:31
@JohnWu Ermm, hypothetical. I just want to do it right. But I do dig through log files of our systems on a regular basis, and the information in there is invaluable. Process crashes aren't common, but they do happen now and then. We use pretty high level languages though (PHP, Javcascript), so true segfaults and the like are rare. Well, maybe I am overthinking this a bit... — Vilx-, Mar 24 '20 at 16:38
How frequent are your errors that the cost of outputting a log becomes a concern? I tend to encounter some performance concerns in my experience with logging, but not so much in exceptional paths of execution... just logging in the normal execution paths, like logging when a small operation begins and ends. — , Oct 21 '20 at 09:58
@DemonCode - what? No, it's the logging in the normal execution paths that has me worried about performance. The exceptional paths (the ones that are really exceptional and unhandled) have me worried about safety. — Vilx-, Oct 21 '20 at 10:32
@Vilx- I see. My bad, I misunderstood and thought you were only using the logging for error cases, so I was a bit confused about performance concerns. In our case (not exactly mission-critical as a computer graphics software), we favor the fastest method that excludes expensive things like file I/O from the calling thread. The main value it has for us is mostly in obscure crash cases we can't reproduce ourselves with the debugger, and it's actually the logging in the normal execution paths rather than error message logs that tend to be most useful in those rare circumstances... — , Oct 21 '20 at 10:45
@Vilx- Although I could very easily see the possibility of cases using asynchronous methods for messages of note to fail to be logged in the crash and burn cases, I've not really found it an issue that crops up in our case. If the user crashes and sends us the log, we've always been able to get a good idea of the proximity where the user's machine crashed by glancing at the bottom. While I could see it happening, we never had a case where that much, if any, relevant logging information failed to be written before the process terminated. — , Oct 21 '20 at 10:48
@Vilx- If the safety aspect is a major concern regardless, the easiest way perhaps to get both is to actually run the logger in a separate process and use a fast form of IPC. Then if your application process goes down, the logger process can still continue to execute and finish and flush its output. — , Oct 21 '20 at 10:49
@DemonCode - In my case it's a backend server which also (among other things) deals with some financial data. So, yes, I'm interested in as much logged data as reasonably possible. And not just for the cases where something crashed and burned, but also for the cases where everything seemed to work fine, but the later numbers didn't add up. A separate process with IPC is also something I've thought about, although it adds complexity and more potential failure points. — Vilx-, Oct 21 '20 at 12:38
@Vilx- I see -- I could imagine an enormous amount of logging in those cases compared to ours. IPC actually makes the most sense to me even though it does bring its own cons. It's also something I thought about a bit but we ended up deciding our needs weren't quite critical enough for it, and maybe some aesthetic distaste for launching separate background processes on the user's machine. One of the difficulties I always found in our case, even though it's much less mission-critical, is the verbosity of logs. We could configure it but if we're trying to get as much detailed information as [...] — , Oct 21 '20 at 20:29
@Vilx- ... as possible in a once-in-a-blue-moon crash and burn case which we can't reproduce later, or on the same hardware as ours, we might want to err towards always verbose even before any problems are detected to ensure we get enough detailed info in the log right before the crash... except the verbose logging cuts into the performance if it's synchronous and flushes. — , Oct 21 '20 at 20:31
@DemonCode - Yup. That's the tradeoff. Performance vs safety. And the more verbose it gets, the more pronounced the difference. To be honest, so far we haven't done anything yet. :D The solution we're using now is (since the project is in NodeJS) - just dump the logs to stdout and let pm2 handle dumping them to a file. Which is... actually, kinda like the IPC solution. I just don't know if stdout gets buffered in my process or not, and could I lose information in a particularly nasty crash. But so far everything's working fine. Maybe I'm overthinking it. :) — Vilx-, Oct 21 '20 at 23:31

Doc Brown · Accepted Answer · 2020-03-24T17:23:52.663

For almost any kind of system, regardless if it is a small or a huge one, it is a good idea to follow the tactics

First make right, then make it fast!

(and: don't optimize "just in case").

Logging (especially when done for the purpose of post mortem analysis) usually needs to be complete and not to loose the final entries, so I would recommend heavily to always start with a solution which flushes each logged line immediately to disk. Otherwise the log could become pretty useless.

Now, when you have added "enough" logging to your application, and when you actually experience a real, measureable performance hit, then (and only then) start thinking about optimizations. For most real-world applications, it is probably enough to have a switch for turning the logging on or off. Or, if that is not sufficient, a parameter to change the logging level (like "minimal", "standard", "verbose"). So one can make the decision of "safety" vs "performance" at run time.

There are definitely some systems for which is not sufficient (for example, when you observe hard-to-reproduce bugs only after some hours under production load, but cannot afford full "debug" logging in a production environment, or when you don't have a chance to restart the system easily in "logging" mode after a failure, like software for a black-box of an airplane).

If you are really developing such extraordinary software, you might have to think of further optimizations - like a background process which receives logging events and can write them buffered to disk even when the main program crashes, or whatever makes most sense for the particular kind of system you are developing. However, as always with optimizations, this has to done with in-depth knowledge of the requirements, the use cases and the system, there is no easy "one-size-fits-all" solution to this.

Why didn't I think of this before? A configuration switch that determines the level of safety needed. By default the fastest options will be quite sufficient, but if you repeatedly run into mysterious problems where it seems that log entries are missing, you can turn on "slow and steady". — Vilx-, Mar 24 '20 at 16:31
This is exactly why loggers have logging levels. Don't run with debug or trace level turned on for no reason. — candied_orange, Mar 24 '20 at 16:57
Logging frameworks generally have special handling for a log message with high severity (i.e. indicating a failure) - they will try to flush so that these aren't lost. If the framework doesn't do this, one approach is to have separate sinks (channels) for low-severity and high-severity log messages. — rwong, Mar 24 '20 at 22:55
Logging levels have two meanings. The first meaning is a severity descriptor for individual log messages. The second meaning is a "discard" directive; causing messages of low severity to be disregarded. — rwong, Mar 24 '20 at 22:57
@rwong: I don't know if "all logging frameworks over the world" provide a "channel for messages which could be disregarded" - maybe this is a feature of one specific framework you have in mind? However, frequent debugging messages, which can slow down a system, need to be persisted immediately if they are used for the goal of finding the root cause of a system's crash. So I have my doubts using an "unsecure logging channel" would be very useful for this purpose. Maybe such a channel could be of use for a different purpose, like profiling. — Doc Brown, Mar 25 '20 at 03:56

score 1 · Answer 2 · answered Mar 24 '20 at 15:29

Has anyone every come up with a way of deciding where your particular situation should lie?

There is one simple question for that: if for any reason logging does not work, do you want to continue your operation?

If your logging is required by some law or business requirement, then I guess there is your answer.

If on the other side your company would rather sell a widget with no debug logs versus not selling the widget, then there you have an answer too.

Remember, your business does not make money on logs, only on transactions of whatever kind your business is part of. "Logs" is a technical detail for you, the developer. If you can, make sure logging does not get in the way of the primary directive.

Hmm, how did I miss your answer all this time? Here, have an upvote! :) This is an interesting approach, I like it. And... to be honest, I don't really have an answer for my particular case! I'll have to think about it. The code is backend code which we run on our server and don't really "sell". But would I be comfortable if it ran without logging? On one hand, if it stops, a lot of people would get pretty unhappy fast. On the other hand, every day we get questions like "why did this happen?" and "why did that happen?", so the logs also get used regularly... — Vilx-, Oct 21 '20 at 23:46

Logging: safety vs performance. How do I choose?

2 Answers2

First make right, then make it fast!